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In this Issue 




Multimedia capability is rapidly becoming a standard feature in today's work- 
stations. In this issue we have nine articles that describe just such a work- 
station, the HP 9000 Model 712. The Model 712 is an entry-level workstation with 
high-performance features that make it an an excellent platform for multimedia 
tools and applications. The article on page 6 provides an overview of the Model 
712, showing how the system is based on three VLSI chips: a multimedia-en- 
hanced PA-RISC processor, the PA 7100LC, a highly integrated I/O chip, and a 
high-performance graphics chip. 

Flawless execution of a product's development is not the only factor that en- 



sures a product's success. Defining the correct feature set and choosing the right design methodologies 
are just as important as the schedule. The article on page 12 describes how the design team for the PA 
7100LC processor used this philosophy to guide the design decisions they made in developing the CPU 
chip, and the article on page 23 describes how these design decisions impacted the methodologies used 
to create, verify, debug, and test the processor chip. 

Low manufacturing cost was one of the main goals for the Model 712 workstation. The article on page 36 
describes how the Model 712's I/O subsystem was designed with with this goal in mind. The I/O chip, 
called LASI, which is an acronym for the two major pieces of functionality on the chip, LAN and SCSI, 
integrates several I/O functions on one chip. Both the LAN and SCSI designs were purchased from out- 
side vendors and imported into the HP IC design process at the artwork and netlist levels respectively. 

Besides performance and functionality, low manufacturing cost was also a primary goal for the graphics 
chip described in the article on page 43. This was achieved by extracting as much performance and 
functionality as possible from readily available technology and integrating components such as the color 
lookup table and the frame buffer onto one chip One of the features incorporated on the graphics chip 
is a technology called HP Color Recovery, which is described in the article on page 51. Using a low-cost 
8-bit frame buffer and HP Color Recovery, the graphics chip can display images that are in many cases 
visually indistinguishable from those of a 24-bit frame buffer costing three times more. 

The combination of software and hardware optimizations, including the implementation of a small set of 
PA-RISC multimedia software instructions enable the video player in the HP MPower 2.0 product to play 
back MPEG compressed video at real-time rates of up to 30 frames per second. As the article on page 60 
explains, this is the first implementation in which real-time MPEG video decompression has been 
achieved via software running on a general-purpose processor. The multimedia enhancements allow 
four parallel operations per cycle by partitioning each of the 32-bit ALUs. 

Integrating telephone capabilities on a workstation is a natural step in the evolution of the electronic 
office. The HP TeleShare option card for the Model 712 workstation, which is described in the article on 
page 69, represents HP's first telephony product. HP TeleShare provides two-line support, with each line 
configurable for voice, fax, or data. 

The product design for the Model 712, described in the article on page 75, shows how a design with no 
fasteners and using environmentally friendly materials and low-cost parts can provide excellent manu- 
facturability and customer ease of use. 

The PA 7100LC processor and the LASI chip are also used in a series of low-end multiuser business serv- 
ers, including the HP 9000 Series 800 Models E23, E35, E45, and E55 and the HP 3000 Series 908, 918, 928, 
and 938. The article on page 79 gives an overview of the architecture of these products and the process 
the development team went through to meet their time-to-market goals. 
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In today's global economy users must have seamless access to applications and data that might be 
thousands of miles from from where they are located. The article on page 85 describes a tool called HP 
Distributed Smalltalk, which provides an object-oriented environment for the rapid development and 
deployment of multiuser, enterprise-wide distributed applications Based on the object-oriented model, 
HP Distributed Smalltalk contains the objects that enable developers to construct applications that pro- 
vide such things as easy access to information across the enterprise, dynamic interaction with other 
users on the network, insulation from differences in operating environments, interoperability, and code 
reuse. The article on page 93 describes an application that was built with HP Distributed Smalltalk. The 
application, HP Software Solution Broker, is a client-server system that gives HP's worldwide technical 
consultants easy access to the latest HP and non-HP software products and tools for customer demon- 
strations and prototyping. 

Two papers in this issue are from the 1994 HP Design Technology Conference, a forum for the exchange 
of ideas, best practices, and results among engineers involved in the development and application of 
integrated circuit design technologies. ► After trying techniques that did not provide enough information 
to track down the root cause of a failure in the FPALU of the PA 7100LC processor, the design team de- 
cided to use a methodology called voltage contrast imaging to find the problem. Voltage contrast imag- 
ing (page 102) allows visual tracking of logical level problems to their source on operating circuits using 
a scanning electron microscope. ► In many IC design centers today design for testability (DFT) is not just 
an abstract goal but a necessity. The article on page 107 describes how a design team faced with the 
need to test over twenty new ASIC components going into four different workstation and multiuser com- 
puter models formed a DFT team to develop a common system-level DFT architecture so that subsystem 
parts could be shared without affecting the manufacturing test flow. 

C.L Leath 
Associate Editor 



Cover 

An artistic rendition of the interconnection between the three main VLSI chips that make up the hard- 
ware architecture for the HP 9000 Model 712 workstation. The die photos are for the PA 7100LC proces- 
sor (top), the graphics chip (lower left), and the LASI chip (lower right). 



What's Ahead 

In the June issue we'll have ten articles on the design o( the HP G1600A capillary electrophoresis instru 
ment, a new liquid-phase sample separation system for analytical chemists. We'll also have articles on 
COBOL SoftBench, a product that encapsulates COBOL in the SoftBench development environment, HP 
Disk Array, a fault-tolerant mass storage solution for PC networks, and two more papers from the 1994 
HP Design Technology Conference. 
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A Low-Cost, High-Performance 
PA-RISC Workstation with Built-in 
Graphics, Multimedia, and Networking 
Capabilities 

Designing as a set the three VLSI components that provide the core 
functions of CPU, I/O, and graphics for the HP 9000 Model 712 work- 
station balanced performance and cost and simplified the interfaces 
between components, allowing designers to create a system with high 
performance at a low cost. 

by Roger A. Pearson 



Designing a workstation emails defining various functional 
blocks to work together to provide a set of features at a de- 
sired level of performance at t he lowest possible cost. Often, 
many parts of the design are leveraged from previous de- 
signs, and only new functionality is designed from scratch. 
This approach may save development costs, hut could result 
in a product that is more costly to build. 

When one component of the system design has performance 
that can't be taken advantage of. whether because of archi- 
tecture limitations or other components' performance limita- 
tions, then the system design suffers by having to carry the 
cost of that unused performance. By designing with the total 
system in mind, so that all components of the design are 
optimized to work together with no wasted performance, 
cost can be minimized. The designers of the HI' M000 Series 
700 Models 712/60 and 712/80 took this approach to offer a 
high-performance combination of graphics, multimedia, and 
networking capabilities at new low prices. The objectives of 
the new design included: 

Providing the high performance of a PA-RISC' workstation at 
the lowest possible cost 

Improving the performance and capabilities of multimedia 
functions through simple extensions to the instruction set 
Enabling an extensive sei of communication features 
through low-cost option cards 
Designing for high-volume manufacturing. 

Instrumental in meeting these objectives was the decision to 
design three new custom VLSI chips together, as a set, to 
achieve new levels of price/performance for the core func- 
tions of CPU, I/O. and graphics. 

Overview 

Three new VLSI chips provide most of the functionality of the 
Model 712 workstation. The PA 7100LC CPU chip interfaces 
directly to the cache and main memory. The LASI (LAN/ 
SCSI) chip does most of the core I/O needed for entry-level 



workstations. The graphics subsystem consists of the graph- 
ics chip and the frame buffer VRAMs. .Ml three chips com- 
municate through the GSC (general system connect ) bus. 
Fig. 1 shows a block diagram of the Model 712 system. 

The Models 712/60 and 712/80 are very similar and differ 
only in their cache sizes and cache speeds and in the main 
system clock speeds. 

The Processor 

The compute power of the Model 712 system is provided by 
the PA-RISC PA 7100LC processor,' - which is packaged in a 
432-pin ceramic PGA. The GPU design was optimized for the 
Model 712 and includes the following features: 
Superscalar CPU 
lK-byte instruction buffer 
Multimedia support 

Cache control for up to 2M bytes of external cache 
ECC ^error correction coding) memory controller 

The clock frequencies of the Model 712/60 and the Model 
712/80 are 60 MHz and 80 MHz respectively. The PA 7100LC 
is described in more detail in the article on page 12. 

Cache 

The PA 7100LC CPU uses an external cache. An external 
cache allows system designers to change the size of the 
cache easily to meet their performance and cost goals. Fur- 
thcrmore, off-chip cache provides all the performance neces- 
sary, without limiting the CPU frequency. 

The external cache is 64 K bytes on the Model 712/60 and 
256K bytes on the Model 712/80 and is logically split into 
equal halves for the instruction and data caches. Combining 
the caches saved pins on the CPU. To further reduce costs, 
industry-standard SRAMs (static RAMs) are used. Table I 
shows the SRAMs used in the Model 712 systems. 
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Static RAMS Used in the Model 712 Systems 


Function 


Size 


Speed 


Quantity 


Tag 


sk bytes 


12 ns 


A 


Data 


8K bytes 


12 ns 


6 


Data 


8K x 9 bits 


12 ns 


2 


Tag 


.(2K bytes 


10 ns 


4 


Dala 


32K bytes 


10 ns 


G 


Data 


32K x 9 bits 


10 ns 


2 



Main Memory 

The main memory for the Model 712 systems has been engi- 
neered to provide high performance with industry-standard 
70-ns SIMMs (single inline memory modules). Currently sup- 
ported SIMMs are available in 4M-, 8M-. 10M-, and 32M-byte 
sizes. Foil! slots arc available and must he filled in pairs for 
a maximum of 128M bytes. 

The Model 712s main memory design minimizes the average 
cache miss penalty. The main memory controller returns 
double words (eight bytes, since a word is four bytes) back 
tO Qte ( IT. Each cache line is made up of four double 
words. When there is a cache miss, the one double word of 
the four in the cache line thai was missed is referred to as 
I he Critical word. To minimize I he miss penalty. Hie double 
word containing the critical word is senl back to the ( PI 
firsl. followed by the remaining I hree double words. 

Bandwidlh is maximized by using fast page mode when con 
secuiivc accesses reside on the same page. This is oil en Ihe 



case when large blocks of memory are accessed and is very 
common in windowed graphics systems. 

The General System Connect Bus 

The general system conned, or GSC, is the local bus that 
connects the three VLSI devices and (he optional I/O card. 
The Q8G bus is designed to provide maximum bandw idlh 
for memory-to-graphies transfers. The bus has 32-bit multi- 
plexed address and data lines lo minimize Ihe number of 
signals. Olhcr features of Ihe bus include: 

• Operation at half the CPTJ frequency (31) or -10 MHz) 

• Support for 1-. 2-. 8-. 10-, or32-byte transactions 

• Central arbitration 

• Parity generation and checking. 

Normally, bus transactions are terminated by a turnaround 
state thai allows drivers lo be turned off before the drivers 
for the next transaction are lunied on. To improve graphics 
performance. Ihe bus supports back-lo-back wriles lo the 
same device without Ihe turnaround stale. This improves 
throughput on transfers Of large blocks ofdata from main 
memory to graphics. 

During transfers from memory to I/O, it is sometimes neces- 
sary to lock the CP1 : out of memory (e.g., when semaphores 
are used). To facilitate this, the GSC bus provides a locking 
mechanism, which prevents ihe CPU from accessing memory 
(to service a cache miss, for example). 

Graphics 

The graphics subsystem consists of a graphics chip and four 
oil-board VRAMs (video RAMs). which provide a 1021 -by 70S 
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pixel frame buffer With a depth (if eight planes at a refresh 
rate of 72 Hz. An optional high-resolution VRAM hoard in- 
creases resolution to 1280 by 1024 pixels. 

The graphics chip was designed with the other system com- 
ponents lo provide high performance at a minimal cost. For 
more information on the graphics c hip, see reference 3 and 
the article on page 43. 

Built-in I/O 

The Model 712 features a number of built-in I/O devices thai 
are intended to address the needs of the majority of users. 

Support for these functions is provided largely by the IASI 
I/O VLSI chip. LASI is a highly integrated chip that provides 
a significant reduction in system cost and increased reliabil- 
ity. The chip is packaged in a 2-10-pin My I" AD package. The 
LASI chip is described in more detail in the article on page 
•'36 and in reference 4. 

The following sections briefly describe the IASI chip's 
buili-in capabilities. 

IEEE 802.3 LAN. LASI contains an Intel 82G696 mcgacell 
which was ported to work with HP's IG process. The LAN 
transceiver, which was not practical to include on LASI. is 
loaded on the printed circuit hoard. The transceiver inter- 
faces to both the AUI (attachment unil interface) and Ether- 
twist media. 

SCSI. The Model 712 uses an 8-bit single-ended SCSI inter- 
face for the optional internal hard drive and external periph- 
erals. The SCSI-2 interface is implemented entirely within 
LASI through a niegacell that was designed by IIP and NCR. 
A netlist for the NCR 53C710 was imported into HP's design 
environment The design was then tuned to work in HP's IC 
process. 

By keeping the SCSI bus stub length to a minimum on the 
printed circuit board and on the connection to the optional 
internal chive. SCSI termination on the Internal side is 
greatly simplified. Short stub lengths allow Hie bus to be 
terminated on the printed circuit board, whether the op- 
tional Interrtal drive is present or not. This saves cost by 
obviating the need for special terminators which would 
otherwise have to be enabled or disabled (manually or elec- 
trically ), depending on the presence or absence of the op- 
tional internal drive. 

Audio. 16-bit CD-quality audio playback and record capability 
is provided by the audio circuit ry. Which consists of a Crystal 
Semiconductor CS4216 CODEC and supporting circuitry. 'Hie 
IASI chip also includes the serial interface to the CS4216. 
Headphone, Microphone, and line-in connectors are located 
on the rear panel. Standard sampling rates include 8. 44.1. 
and 48 kHz. 

Real-Time Clock. A real-time clock is designed into the IASI 
diip. Battery' backup keeps time while the workstation is 
powered down. 

PS/2. There are two PS/2 connectors on the rear panel that 
allow connection to a low-cost industry-standard keyboard 
and mouse. The PS/2 interface circuitry is integrated into 
the LASI chip. 

RS-232. An RS-232 interface has also been designed into the 
LASI chip. The Model 712 buffers the signals with a MAXIM 



211 tO provide an RS-232 serial port. IASI buffers inbound 
and outbound data with 16-byte FIFOs, at baud rates from 
50 to 454 kbits/s. 

Parallel. The IASI chip also provides a parallel port conforming 
to the Centronics industry standard. 

Flexible Disk Support. A Western Digital WD37< 65< flexible 
disk controller interfaces IASI to an optional internal per- 
sonal-eomputer-style flexible disk drive. 

Flash EPROM. An S-liit bus on the IASI chip is demultiplexed 
by two 71CIIT374 latches to provide the address and data 
lines necessary to address the two 128K-byte flash EPROMs 
that contain the boot firmware. The Hash EPROMs are also 
used to store configuration parameters, eliminating the need 
for an EEPROM and its associated cost. 

I/O System Support. LASI provides a number of miscellaneous 
I/O system support functions, including: 

• Clock generation. IASI derives all the necessary clocks re- 
quired by the I/O circuitry from the main system clock. It 
does so by using simple divide-by-n counters and two digital 
phase-locked loops, 

• System arbitration support. LASI arbitrates 1 IS) bUS re- 
quests from the I/O devices within LASI, as well as from the 
CPU and optional expansion card. 

• Interrupt support. LASI also provides and manages external 
interrupt capability for the various I/O devices. 

Optional I/O 

For those users who need functionality beyond that provided 
by the built-in I/O. the Model 712 includes two personality 
slots that can be configured with a variety of other I/O func- 
tions. The first of these slots is referred to as the expansion 
slot and includes a connection to the GSC bus. The second 
slot provides a connection to the serial audio stream, and 
is intended for telephone functions. This slot is called the 
telephony slot. 

Expansion Cards. Expansion cards are optional cards that 
connect directly to the GSC bus to provide a variety of other 
I/O functions. 

Since IASI has a configurable address space and can be 
configured as an arbitration slave, many of the expansion 
cards rely on a second IASI chip to implement much of their 
functionality. 

The following Optional expansion cards are provided for the 
Model 712: 

• Second serial port. The second serial port card uses its own 
L\S1 chip and support circuitry identical to that on the sys- 
tem board to provide an additional RS-232 port. 

• Second LAN AIT and second serial interface. Tlus card also 
uses a L\SI chip and circuitry similar to that on the system 
board to add an additional IEEE 802.3 LAN with an attach- 
ment unit interface (ALT) and a second RS-232 interface. 

• X.25 and second serial interface. A Motorola 68302 mult i- 
protocol processor interfaced to the 8-bit bus of a slave 
IASI provides X.25 networking lo a 25-pin X.21bis port for 
speeds of 1.2 kbits/s to 19.2 kbits/s. The second RS-232 se- 
rial interface is implemented in the same fashion as the 
other cards. 

• Second display. A second display can be added to the sys- 
tem with the second display card. This card duplicates the 
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Fig. 2. Block diagram of the 
Model 712 audio and telephony 
circuits. 



graphics functionality that is already built into the system 
board by replicating the graphics chip and its supporting 
circuitry. 

• Token Ring/9000. The Token Ring/yOOO card provides IEEE 
802.5 LAN functionality through the use of a Texas Instru- 
ments token ring controller chip and a custom ASIC that 
provides the GSC interface. Unshielded and shielded 
twisted pair connections are provided at data rates from 4 
Mbits/s to 16 Mbits/s. 

• Second display and second LAN All/RS-232. This option 
combines the features of the second graphics display and 
the second IAN AL'I/RS-232 options. Since the circuitry for 
this option would not fit on a single expansion slot card, 
some of the circuitry resides on a daughter card that is con- 
nected to the expansion slot card. The daughter card gets 
power and mechanical support through the telephony con- 
nector, so when this option is installed, the telephony option 
is not available. 

Telephony. The telephony card installs in the telephony slot 
and provides t wo lines of telephone access. Bach of the 
lines can be configured to support voice, data modem, or fax 
modem. 



The system board's heailset and microphone serve as the 
human interface for voice telephony, and an interface chip 
on the telephony card called XBAR links the system board's 
audio circuitry to die telephony fimctions (see Fig. 2). 

This arrangement allows recording and playback during 
telephone conversations. It also supports digital mixing of 
microphone, line-in, telephone, and prerecorded audio. 
Caller-ID decoding is supported, as are DTMF (dual-tone 
multifrequency) encoding and decoding, and dual-line 
conferencing. 

The XBAR Chip serves to route information between the 
IASI I/O chip, the audio CODEC, and the DSP blocks in a 
variety of programmable ways. Data Ls transferred to and 
from the system board through two serial {lata paths. Two 
additional serial paths send and receive data to and from the 
DSPs. Two 8-bit parallel purls are used by the DSPs during 
the DSP boot process. XBAR has a few other functions, in- 
cluding receiving incoming phone rings and controlling 
phone line hook status. 

Each DSP subsystem consists of an Analog Devices 
ADSP2101 processor and 32K by 21 bits of external 20-ns 
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Fig. 4. The Model 712 system i »< i construction. 

SRAM for L)SP programs and data. Each processor has two 
serial port s, one for XBAR and I lie other for the Analog De- 
vices AD28mps01 analog front end (phone CODEC). Each 
phone CODEC connects to a standard two-wire telephone 
line through a Silicon Systems Incorporated 73M9002 daia 
access arrangement, which provides the isolation circuitry 
required by communications regulatory agencies. 

The telephony card is described in more detail in the article 
on page 69. 

Printed Circuit Board Design 

The Model 712 system contains a single printed circuit board 
called the system board. Fig. 3 shows a photograph of the 
system board. The system board supports all the functionality 
of the Model 712 system except for the optional boards and 
peripherals. 

The system board is 10 layers deep, and has U.005-inch 
traces and spaces, ll measures 1 1.4 inches by 5.6 inches and 
uses double-sided surface mount technology. 

The board construction shown in Fig. 1 was designed with 
the printed circuit board vendor to ensure thai the least 
costly materials were chosen to obtain the necessary electri- 
cal parameters. Although it is designed to exhibit specific 
trace impedances, the blank printed circuit board is not a 



controllcd-impedmice design, which saves cosl. The finished 
board size is optimized to make the best use of standard 
subpanel sizes used by the printed circuit board vendor. 
Although the board does use 0.005-inch traces and spaces, 
these minimum geometries are used only when necessary. 
Whenever possible, less aggressive routing is used to help 
with board yield and to keep down the cost of the board. 

The design of the blank printed circuit board presented a 
number of technical challenges and some cost-saving 
Opportunities. 

Performance Challenges. The clock and cache layouts pre- 
sented some very special challenges in designing the printed 
circuit board. 

Fig. 5 shows a simplified block diagram of the clock circuit 
used in the Model 712. .Ml ECL circuitry is powered from the 
\ cc Supply, and all clock receivers in I he VLSI are designed 
to operate at these shifted ECL vol I age levels. This saves the 
cost of additional supply voltages and level translators. The 
master clock is first buffered, and multiple copies are routed 
to the receiving VLSI. This way, the delay to each device can 
be independently controlled to minimize clock skew and 
maximize system performance. Clocks are all routed on 
inner layers, where propagation delay is better controlled 
because of the trace's stripline nature. The clocks are driven 
as differential pairs and are routed to each other to mini- 
mize differential noise generation and susceptibility. The 
clock circuitry also features an interesting termination 
scheme. This pi-tenuination network is designed to approxi- 
mate I he same load as other more traditional terminal ion 
schemes. However, it has the advantage of using zero supply 
current and fewer parts. 

Fig. 6 shows a conceptual representation of how the cache 
is routed. The cache fine is routed to minimize cache ad- 
chess drive delay. This arrangement also cuts down on the 
number ofviasand maintains an unbroken ground plane. 
Address lines are routed from the GPU to the first via split 
on inner layers, where the impedance is close to half that of 
the outer layers. This is to better match the impedance of 
the traces on the two outer layers, which are essentially in 
parallel. 

EMC and EMI Control. In addition to more traditional methods 
of EMC and EMI control, the Model 712 system board uses 
features built into Ihe blank printed circuit board to mimic 
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I hi' functionality of equivalent discret e designs. However, 
since they are built into the printed circuit board t heir 
benefits are essentially free. 

Small spark gaps are placed near many of the connectors to 
help control ESD. These spark gaps are simply very small 
trace segments separated at minimum geometries to provide 
a shunt path for ESD energy from signal to ground. 

To control RFI. the printed circuit board makes use of a 
number of buried capacitors. Buried capacitors are essen- 
tially small capacitors whose plates are all or part of the 
printed circuit board's signal or ground layers. The dielectric 
material of the printed circuit board serves to separate the 
plates of the capacitors. Each power plane is effectively 
bypassed to ground by placing a ground plane in close prox- 
imity to it. Furthermore, some signals are also bypassed to 
ground with small buried capacitors to shunt Unwanted RFI 
energy to ground. 

Conclusion 

By taking the approach of designing from the ground up, the 
Model 712 hardware designers have optimized each part of 
the design to work together to provide outstanding perfor- 
mance at very low cost. Designing the VLSI components as a 
set balanced performance and cost and also simplified t he 
Interfaces between the devices. By building in the features 
wanted by most customers and making less common features 

available only on low-cost option boards, the system COS) is 
minimized for most customers. 

The Model 712 system performance is summarized in Table 
II. 





Table II 
Model 712 Performance 




Specification 


712760 


712/80 


SPE( Sn492 


58.1 


84.3 


SI'Ki 'fp!t2 


85.5 


122.3 


MFLOPS(DP) 


12.8 


30.6 


AIM APR II 


44.5 


73.8 
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The PA 7100LC Microprocessor: 

A Case Study of IC Design Decisions 

in a Competitive Environment 

Engineering design decisions made during the early stages of a product's 
development have a critical impact on the product's cost, time to market, 
reliability, performance, and success. 

by Mick Bass, Patrick Knebel, David W. Quint, and William L. Walker 



In today's Competitive microprocessor market, successful 
design teams realize that flawless execution of product de- 
velopment and delivery' is not enough to ensure that a prod- 
uct will succeed. They understand that defining the correct 
feature set for a product and creating design methodologies 
appropriate to implement and verily that feature set are just 
as important as meeting the product schedule. 

The design decisions that engineers and managers make 
while defining a new product have a critical impact on the 
product's cost, time to market, reliability, performance, future 
market demand, and ultimate success or failure. Engineers 
and managers must make trade-offs based on these factors 
to decide which features they should implement in a new 
product and which they should not. Further, they must plan 
their product development effort so that the methodologies 
by which they develop their product are sufficient to ensure 
thai they are able to implement the product definition within 
the required cost, schedule, and performance constraints. 



Design choices arose frequently while we were defining and 
implementing the PA 71001/' microprocessor. 1 We were tar- 
geting the PA 7100LC to be the processing engine of a new 
line of low-cost, functionally rich workstation and server 
products. Our design goals for I he CPU were to provide the 
system performance required for our Uirgei market at an 
aggressively low system cost and to deliver I he CPU on a 
schedule thai would not delay what was to become HP's 
steepesl computer system production ramp to dale. Fig. 1 
shows a simplified block diagram of the PA 7100LC 
processor. 

To meet these goals required that we sometimes had to shift 
our focus from the CPU to the impact of a particular feature 
upon performance and cost at the system level. Hewlett- 
Packard's position as a vendor of both microprocessors and 
Computer systems allowed us to use this technique with 
much success. 2,3 Even with tlris focus, however, the correct 
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decision could be far from obvious. We often identified sev- 
eral alternative implementations of a particular feature, 
each with its own impact on cost, schedule, and perfor- 
mance. Trading these impacts against one another proved 
very challenging. Design decisions also impacted each Other; 
with the outcome of one serving as a critical input to others. 
The effect of a decision, for this reason, was sometimes 
much larger than would have appeared at first glance. Some- 
times decisions created additional requirements, either for 
new features or for new support methodologies. All of these 
factors played together to underscore the fact that it was 
critical to our product's success to have a decision process 
that worked well. 

We knew that a good definition of the PA 7100LC would 

require that we make feature decisions in several areas. 

including: 

Cache organization 

Number of execution units and superscalar design 

Pipeline organization 

Floating-point functionality 

Package technology 

Degree of integration 

Multimedia enhancements. 

We also knew thai we needed to select development meth- 
odologies consistent with the feature decisions that we 
made. Product features and required design methodologies 
are often strongly connected. We couldn't consider the bene- 
fits of one without the costs of the other, and vice versa. 
Methodologies that were impacted by our feature-set deci- 
sions included: 
Synthesis 
Place and route 
Behavioral simulation 
Presilicon Functional verification 
Postsilicon functional and electrical verification 
Production test 

These methodologies are discussed in the article on page £J. 

The cumulative effects of our decisions led to the creation of 
a low-cost, single-chip processor core that includes a built-in 
memory controller, a combined, variable-size off-chip pri- 
mary instruction and data cache, a lK-byte on-chip instruc- 
tion buffer, and a superscalar execution unit with two integer 
units and one floating-point unit. We reduced the size and 
performance of the floating-point unit, which we had lever- 
aged from the PA TWO processor. 1, 1 We added Iupy, samplc- 
on-lhe-fly. and debug modes to enhance testability, reduce 
lest cost, and accelerate the postsilicon schedule. We tai- 
lored the methodologies by wluch we created the chip to 
match the features that we had decided upon. 

This article provides examples of our decision-making pro- 
cess by exploring the decisions that we made for several of 
the features listed above. In each case, we present the alter- 
natives that we considered, the costs and benefits of each, 
and the Impact on other features and methodologies. We 
discuss our decision criteria. Since we strive to continually 
improve our ability to make good design decisions, we also 
present, wherever possible, a bit of hindsight about the pro- 
cess. In most cases, we still believe that we selected the 



correct alternative. However, if this is not the case, we dis- 
cuss what we have learned and the modifications we made 
to our process to incorporate this new knowledge. 

The Design Decision Process 

Most design decisions ultimately come down to trade-offs 
between cost, schedule, and performance. I'nfonunately. it 
is often difficult to determine the true cost, schedule, or 
performance for the wide variety of implementations that 
are possible. And since these three factors most often play 
against each other, it is necessary to make sacrifices in one 
or two of the areas to make gains in the others. 

The cost of a processor core is determined by the cost of 
silicon die. packaging, wafer testing, and external SRAM and 
DRAM. Breaking this down, we find that cost of a die is de- 
termined by the initial wafer cost and the defect density of 
the IC" process being used. Wafers are more expensive for 
more advanced processes because of higher equipment, 
development, and processing costs. The die packaging costs 
are determined primarily by package type and pin count 
Large-pinout packages can be very expensive. An often ig- 
nored cost is the tester time required to determine that a 
manufactured part is functional. Reducing the time needed 
for wafer and package testing directly reduces costs. Finally. 
SRAM and DRAM costs are determined by the number, size, 
and speed of the parts needed to complete the design. 

The schedule of a project Ls determined by the complexity i if 
the design and the ability to leverage previous work. Each 
design feature requires certain time investments and has 
associated risks. Time is required for preliminary feasibility 
investigations, design of control algorithms, implementation 
of circuits, and presilicon and postsilicon verification. 
Schedule risks include underestimation of time requirements 
because of unexpected complexity and the extra chip turns 
required to fix postsilicon bugs associated with complex 
design features. 

Performance is conceptually simple, but because of the in- 
tricacy of processor design it is often difficult to measure 
without actual prototypes. HP has invested heavily in perfor- 
mance simulation and analysis of its designs. Results from 
HP's system performance lab were invaluable in making 
many of the design decisions for the PA 7100LC. By support- 
ing a detailed simulation model of each processor developed 
by HP, the system performance lab is able to provide quick 
feedback about proposed changes. IIP also uses these mod- 
els after silicon is received to help software developers (es- 
pecially for compilers and operating systems) determine 
bottlenecks that limit their performance. 

Engineers at the system performance lab design their proces- 
sor simulators in an object-oriented language to allow easy- 
leverage between implementations. All processor features 
that affect performance are modeled accurately by close 
teamwork between the performance modeling groups and 
the hardware design groups. As the hardware group consid- 
ers a change to a design, the change is made in the simula- 
tor, and simulations are done to allow simple comparisons 
that differ by only a single factor. This is continued ill an 
iterative fashion until all design decisions have been made. 
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aficr which wo are loll with a simulator thai matches the 
hardware to be bvdlt 

Without performance simulations, it would ho very difficult 
to estimate performance for a proposed implementation. 
Even something as simple as a change in operating frequency 
has effects that are difficult to estimate because of the inter- 
actions between fixed memory access times and processor 
features. As processor frequency increases, memory latencies 
increase, but this increased latency is sometimes (but not 
always) hidden by features such as stall-on-use. Stall-on-usc 
allows the processor to continue execution in the presence 
of cache misses as long as the data is not needed for an op- 
eration. Those interactions make accurate hand calculations 
impossible, creating a need to use simulations for c omparing 
many different implementation options. 

The performance' simulations are based on SPEC and TPC 
benchmarks. While these benchmarks an' useful forgather- 
ing performance numbers and making comparisons, they do 
not loll the whole story. Many applications are not repre- 
sented by the benchmarks, including graphics, multimedia, 
critical hand-coded operating system routines, and so on. 
When evaluating features related to Uiese applications, we 
work directly with people in those areas to analyze the im- 
pact of any derisions. Often this involves analyzing by hand 
critical sections of the code (e.g., tight loops) to evaluate 
the overall performance gain associated with a feature. For 
the PA 7IU0LC. this was especially true for the multimedia 
features. 

The ability to quantify the impact of proposed features on 
cost, schedule, and performance was paramount to our 
ability to make sound design decisions. 

Integration 

The first design decisions that we made were related to the 
high-level question "How highly integrated should we make 
the chip?" This led to the questions: Should we include an 
on-chip cache or not".' If so. how large should it be? If we 
have an off-chip cache, how should we structure it? How 
should the CPU conned to memory and I/O? Should the 
memory controller be integrated or not? 



The primary question was whether the CPU, cache, and 
memory system should live on a single die in a single pack- 
age, or whether we should partition this functionality onto 
two or more chips. 

The trade-offs involved in this decision were numerous. Die 
cost would increase for a mullichip solution. Package cost 
would vary' with the partitioning that we chose, as would 
package type and maximum pin count. Required signal-to- 
ground ratios would vary with package type, which would 
either limit the signal count or require more pins (at a higher 
cost ). Performance, design Complexity, and schedule risk 
would be greatly impacted by the partitioning decision. 

To sort out these trade-offs, we started with a packaging 
investigation that quantified cost, performance, and risk for 
different packaging alternatives. This investigation yielded a 
preferred package: a 432-pin ceramic pin grid array see (Fig. 
2a). This package, with its large signal count, could accom- 
modate the extra interfaces required to include a memory 
controller, an I/O controller, and an external cache control- 
ler. 

The memory controller and cache investigations were 
lighlly coupled. Performance simulations always Included 
features from both subsystems because small changes in the 
behavior of one subsystem could drastically affect the per- 
formance of the olhei. In I he end we realized llial 1 1 it ■ perfor- 
mance gains brought by an integrated memory controller 
enabled smaller, cheaper caches without sacrificing ov erall 
performance. This realization drove the development of the 
cache subsystem. 

Package Selection and CPU Partitioning. We targeted the IC 
package design with the objective of minimizing system cost 
with little compromise in performance. The customary pack- 
age for CPl" chips is either a quad fiat pack (QFP) or a pin 
grid array. The QFP is a plastic, low-profile package with 
gull-wing connections on four sides. The QFP is inexpensive 
and easy to mount on a printed circuit board and has gained 
acceptance rapidly for surface mounting to printed circuit 
boards. It has the disadvantage diat the number of puis is 
limited. Pin coiuits above 2(10 are fragile and difficult to keep 
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coplanar for surface mounting. The package also has very 
limited ability to dissipate heat because the chip is encased 
in plastic. A recent improvement to this package sandwiches 
the chip between two piec es of aluminum, which can dissi- 
pate up to four watts of heat ( ten watts with a heat sink). It 
was this metal quad, or MQUAD. that became a c andidate 
for a low-cost package for our lugh-performance CPl " HP's 
package of choice for previous CPUs has been the ceramic 
pin grid array, a complex brick of aluminum oxide and tung- 
sten built in layers and fired at 2000 T. The PGA used for 
the PA TUMI processor (the basis for the PA 7100LC) was a 
-504-pin design that incorporated the following advanced 
features: 

A tungsten-copper heat-conducting slug for superior thermal 
conductivity to the heat sink 

Ceramic chip capacitors mounted on the package for power 
bypassing 

Thin dielectric layers that minimized pow er supply 
inductance 

I Ise of 0.004-in vias internally (most are 0.008-inch). 

This package performed its thermal and electrical duties 
very well, but its cost had always been an issue. 

Our strategy to develop a low-cost CPU coupled chip parti- 
tioning options with the packaging options of using either 
two low-cost MQUAD packages or placing a single large chip 
in a PGA. Hie two-chip CPU could be placed in one 240-pin 
and one 304-pin MQUAD (see Figs. 2b and 2c). The other 
alternative was to place a larger Integrated chip in a single 
432-pin PGA (see Fig. 2a). The first cost estimate assumed 
thai the PGA would be priced similarly to the 504-pin pack- 
age. The total cost of both MQUAD chips was initially 
thought to be about 75% less than the PGA estimate. This 
would seem to indicate that the MQUAD would be the defi- 
nite candidate to meet our low-cost goals. However. I hat 
perception changed as our investigation continued. 

We didn't expect I he MQUAD's electrical performance to 
match that of the PGA because the MQUAD we were consid- 
ering had only one layer of signals and no ground planes. 
Ground planes can be used to shield signal traces from each 
Other and reduce inductances of signals and power supplies. 
The PGA could incorporate several ground planes if neces- 
sary. On the other hand, the MQUAD package can only ap- 
proach the shielding effect of (he ground planes by making 
every other lead a ground, which severely limits the number 
of usable signals. Gaining a lower package price by using the 
MQUAD w ould require redesigning the I/O drivers specifi- 
cally to reduce rise times and thereby control crosstalk and 
power supply noise. 

The PA 7100 I'GA's electrical performance exceeded the 
needs of this chip, so the strategy shifted to trading away 
excess performance to gain lower cost. The number of 
power and ground planes was reduced to two. The design 
was also modified to optimize performance without using 
package-mounted bypasses or thin dielectric layers. The 
PGA design was reduced to tour interna] metal layers with 
no bypassing, no Ihin dielectric layers, and no 0.004-in vias. 
all of which reduced cost compared to the 504-pin PGA men- 
tioned above. 

The power dissipation of (he chips would also have been an 
issue for the MQl ADs. Ileal sinking to further improve lite 



thermal resistance of the packages might have been re- 
quired. CPU desigits are often upgraded to higher clock 
speeds after first release, so if package heat dissipation is 
marginal, upgrade capability is jeopardized- (Typically, 
power dissipation is proportional to operating frequency, i 
The 504-pin PGA had already been used to dissipate 25 
watts, which left an opportunity for cost-saving modifica- 
tions. With the thermal margin in mind, two design changes 
w ere investigated, one to use a lower-cost coppcr-Kovar- 
copper laminate heat spreader, and the other to eliminate 
the heat spreader entirely. The first option was dismissed 
because of failures found during a low-temperature storage 
test. (The laminate heat spreader detached from the ceramic- 
body because of a disparity in thermal expansion rates.) The 
second option was also dismissed when the thermal resis- 
tance of the ceramic carrier was found to be too high. 

The lime schedule for the completion of reliability testing 
and manufacturing feasibility studies had to be considered 
when evaluating the two technologies. The PGA was a ma- 
ture tecluiology with considerable experience behind it, and 
the time schedule and results of the testing could be deter- 
mined with some certainty. The MQUAD was a new technol- 
ogy by contrast. The design was solid, but had several new- 
features that were untested in terms of long-term reliability. 
Despite the strong desire to exploit new technology, the 
schedule risk was a significant factor. 

By the time the partitioning decision was to be made, the 
PGA cost had shrunk to almost half of Its original c ost, the 
304-pin MQUAD was presenting schedule risks, and both 
MQUADs had marginal power dissipation. Possibly most 
important, the PGA provided a robust solution with thermal 
and electrical margins. The cost difference was still signifi- 
cant, bill I he PGA provided a flexibility to the chip designers 
thai offset its disadvantages. Thus, the PGA package was 
chosen for the PA 7100LC. 

Memory Controller Destiny. Whether or not to integrate the 
memory and I/O controllers onto ihe < 'PI ' die was one Of the 
most direction-forming decisions that we made. To decide 
correctly, we had to consider the effects of integration on 
factors such as multiprocessor capabiliiy. system complexity, 
memory and I/O controller design complexity, die cost, mem- 
ory system performance, and memory system flexibility. 

Traditional multiprocessor systems have a single main mem- 
ory controller and I/O controller (see Figs. 3a and 3b). These 
controllers maintain connections to the multiple processors. 
SySteniS organized in this way separate the memory and I/O 
controllers from the CPU. This organization allows users to 
upgrade entry-level systems to include multiple processors 
at the expense of reducing the memory and I/O performance 
of uniprocessor systems and adding significant complexity 
to both the memory and cache controllers. 

< rur design goals focused on maximizing uniprocessor per- 
formance. HP w as already shipping desktop multiprocessor 
systems built around the PA 7100 microprocessor at the 
lime we were making these decisions. The market segment 
that we were targeting for the PA 71001X3 demanded peak 
uniprocessor performance at a low system cost. Since our 
large! market didn't require multiprocessing as a system 
option, we directed our efforts toward the benefits that we 
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Fig. 3. i m ),( l> ) Multiprocessor architectures in which ihe memory 
and I/O controller are separate from the CPUs, (c) A uniprocessor 
syslem in which Ihe memory anil I/O controller are Integrated into 
the CPl" chip. 

could bring to a system through a focused uniprocessor de- 
sign. 

Integrating the memory and I/O controller with the CPU in a 
uniprocessor system (Fig. 3c) can have a dramatic effect on 
reducing cache miss penalties by decreasing the number of 
Chip boundaries thai ihe missing data must cross and by 
allowing the memory and I/O controller early access to im- 
portant CPl ' internal signals. Miss processing on the memory 
interface can effectively begin hi parallel with miss detection 
in Ihe cache controller. An integrated memory controller Can 
even use techniques such as speculative address issue to 
begin processing cache misses before the cache controller 
detects a cache miss. 

The reductions in CPI (cycles per instruction) that we could 
achieve by integrating the memory controller allowed us the 
degrees of freedom that we needed to explore certain cache 
architectures in greater detail. Some of these architectures 
are described in the next section. 



System complexity is reduced wit h an integrated memory and 
I/O controller. The 432-pin ( PGA that we were considering 
for an integrated design had sufficient signal headroom to 
enable separate, dedicated memory and I/O connections. A 
two-chip approach, using Ihe lower-cost MQIAD packages, 
would be forced lo share pins between the memory and I/O 
connections to accommodate Ihe low signal count of the 
MQl'Al) package, which would increase system complexity. 

An integrated memory and I/O controller also simplifies Ihe 
interface lo the CPL". Since this interface connects two enti- 
ties on the same die, signal count on the interface became 
much less Important) which allowed us to simplify the 
interface design considerably. 

On the down side, integrating a memory and I/O controller 
required enough flexibility in its design to satisfy the broad 
range of system customers that our chip would encounter. 
However, this rec|iiirement also exists for a nonintegrated 
solution. Historically, system partners have not redesigned 
memory controllers that the CPl' team has provided as pail 
of a CPl' chipset. HP's advantage of providing both proces- 
sors and systems has allowed us lo work closely with sys- 
tem designers and enabled us to meet their needs in both 
integral ed and noninlegraled chipsets. 

In summary, integrating the memory and I/O controller onto 
the CPU core introduced a gain in performance, a reduction 
in complexity and schedule risk, and several possibilities for 
reduced cost in the cache subsystem. These were the com- 
pelling reasons to move the memory controller onto Ihe 
CPl ' die and continue exploring cache alternatives and opti- 
mizing memory system performance. 

Cache Organization. One of the distinguishing characteristics 
of IIP PA-RISC designs over the past several Implementa- 
tions has been the absence of on-chip caches in favor of 
large, external caches. While competitors have dedicated 
large portions of their silicon die to on-chip RAMs. HP has 
continued to invest in aggressive circuit design techniques 
and higher pin count packages thai allow their processors to 
use industry-standard SRAMs, while fetching instructions 
and data every cycle at processor frequencies of 100 MHz 
and above. This has allowed our system partners to take a 
single processor chip and design products meeting a wide 
range of price and performance [joints for markets ranging 
from the low-cost desktop machines to high-performance 
servers. For example, the PA 7100 chip has been used in 
systems with cache sizes ranging from 12SK bytes to 2M 
bytes and processor frequencies ranging from :!•'! MHz lo 1im> 
MHz. 

The main design goals for the PA 7100LC were low cost and 
high performance. Unfortunately, high-performance systems 
use large, fast, expensive caches. Obviously, trade-offs had 
to be made. As with previous implementations, the design- 
ers stalled with a clean slate and considered various cache 
options, including on-chip cache only, on-chip cache with an 
optional second-level cache, split instruction and data off- 
chip caches, and combined off-chip caches (see Fig. 4). Ulti- 
mately, the cache design was closely linked to Ihe memory 
controller design because of the large effect of memory 
latency on cache miss penalties. 
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Fig. 4. Different Cache organizations, (a) On-chip cache, (b) On- 
chip cache with an optional second-level cache, (c) Split instruction 
and data off-chip caches, (d) Combined off-chip caches. 

On-chip caches have the obvious advantage that they can 
allow single-cycle loads and stores at higher chip frequencies 
than are possible with many off-chip cache designs. They 
also allow designers lo build split and associative cache 
arrays which would be prohibitive for off-chip designs be- 
cause bf the large number of I/O pins required Unfortu- 
nately, in current technologies on-chip caches lend in be 
fairly small (8K bytes lo 32K bytes) and even with two-lo- 
four-way associalivily. they have higher miss rates than 
larger (OIK bytes to 250K bytes) direct-mapped, off-chip 
caches. Also, on-chip caches require a substantial amount of 
chip area, which translates lo higher costs, especially for 
chips using leading-edge technology with high defect densi- 
ties. This extra chip area also represents lost opportunity 
cost for other features that could be included in that area 
Examples include an on-chip memory and I/O controller, 
graphics controller, more integer execution units, multi- 
media special function units, higher-performance floating- 
point circuits, and so on. 

Another drawback of on-chip caches is their lack of scalabil- 
ity; providing multiple cache sizes requires fabricating multi- 
ple parts. To overcome this limitation designers can allow 
for optional off-chip caches. The off-chip caches can range 
in size and speed anil can provide flexibility for system de- 
signers looking tO meet different price/performance choices. 
Low-end systems need not include the off-chip cache and 
can be built for a lower cost. High-end systems can gel a 
performance boost by paying the extra cost to add a second- 
ary off-chip cache. For most systems, the cost for this flexi- 
bility is added pin count lo allow for communication with 



the off-chip caches. Other systems might be able to multi- 
plex the cache lines onto some already existing buses such 
as the memory bus. 

For the PA 7100LC. we determined that a primary on-chip 
cache would cost too much in terms of more expensive 
technologies, increased die size, and the lost opportunity of 
putting more functionality on the chip. Without a primary 
on-chip cache, we were able to design a processor with two 
integer units, a full floating-point unit including a divide and 
square root unit, and a memory and I/O controller. We 
achieved this functionality using only 905.000 FETs in 0.8 
micrometer (CMOS26) technology on a die measuring 1.4 
cm by 1.4 cm (see Fig. 5). CMOS2G is a mature HP process 
that has been used for several processor generations. As a 
result, it has a low defect density and thus, a low- cost. A 
processor with an on-chip cache would have required a 
more advanced technology having higher wafer costs and 
defect densities. Of course, without an on-chip cache, we 
were challenged to design a low-cost off-chip cache that 
allowed accesses at the proc essor frequency. 

HP's previous implementations of PA-RISC' were built with 
independent instruction and data caches made up of industry - 
staudard SRAMs (see Fig. 4c). It would have been easy to 
leverage the independent direct-mapped instruction and 
data cache design from the PA 7100, but we were deter- 
mined to find a less expensive solution. Independent cache 
batiks require a high pin count on the processor chip be- 
cause each bank requires (14 data pins and about 24 pins for 
tag, flags, and parity. Thus, combining instructions and data 
into a single set of cache RAMs ( Fig. 3d) saves about 88 pins 
on t he processor chip. These extra pins directly affect pack- 
aging costs. Also, providing split caches requires using more 
SRAM pans in a given technology. Systems based on the PA 
7100LC with a combined cache require only 12 SRAM purls 
using x8* technology. By leveraging the aggressive I/O de- 
sign from previous implementations, the PA 7100LC can 
access 12-ns SRAM parts every cycle when operating at fre- 
quencies up lo (if> MHz. Since 8K * S. 12-ns SRAMs are com- 
modities in today's market, the cost of a 04K-byte cache sub- 
system for a 00-MHz PA 7100LC is comparable to the price 
we would have paid for a much smaller on-chip cache. 

Combined instruction and data caches have one large draw- 
back. Since the PA 7100LC processor can consume instruc- 
tions as fast as the cache can deliver them, there is little or 
no cache bandwidth left to satisfy load and store operations. 
To solve this problem, we needed to implement some type of 
instruction buffer on the processor chip. A large instruction 
buffer would have all the drawbacks of the on-chip cache 
design discussed above, so we were determined to find a 
way to achieve the desired performance with a small buffer. 
We knew we would need a mechanism lo prefetch instruc- 
tions from the off-chip combined cache into the dedicated 
on-chip buffet (lining idle cache cycles. Thus, we started 
with a standard direct-mapped 2K-byte buffer and simulated 
various prefetch and miss algorithms. As expected, we 
found thai performance was extremely sensitive to the 
buffer miss penalty, which ranges from zero to two states 

* RAM sizes are quoted in depth by width |i e . 64K «8 is 65.536 deep hy 8 bits widel 
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for a two-word cache line, depending on branches and pre- 
fetches. We designed I lie prefetch machine to use virtually 
every idle cache cycle and tried to get early access to lite 
off-Chip cache on branches. Some branches can be treated 
like loads or stores and he given access lo the off-chip cache 
even before they have access to the on-chip cache. Using a 
small buffer with a good prefetch algorithm we were able to 
greatly mitigate the penalty associated with having a single 
bus to the off-chip cache. Of course, if cost had not been 
such an important factor, we still would have implemented 
split off-chip caches to get the extra performance and to 
reduce complexity. 

After settling on prefetch and miss algorithms, we simulated 
various buffer sizes, associativity, and line sizes to determine 
the optimal configurations. We found that associativity in- 
creased performance by less than 1% while increasing area 



Fig. 5. A photomicrograph of the 

PA 7100LC CPI . 'Die die mea- 
sures 1.4 cm by 1.4 cm and con- 
tains 905,000 FETs in o.S-nuerom- 
eter (HP ('MO.S20) technology, 

and complexity. We also found that if we decreased the 
buffer from 2K bytes to 1 K bytes and used the resulting area 
savings to increase Ihe TLB (translation lookaside buffer) 
size from -18 lo M entries, we could gain about 1% perfor- 
mance improvement without any added complexity or area. 
Thus, we chose a lK-bytc direct-mapped instruction buffer 
anil a 64-enixy TLB. We also simulated odter buffer options 
requiring less chip area, including a split buffer design with 
a 128-byte branch target buffer and a 2">(i-byte prefetch 
buffer. We had hoped that the prefetch buffer could keep up 
with the instruction demand for sequential code while the 
branch targe't buffer supplied instruction targets for 
branches. Unfortunately, such a small branch target buffer 
could not hold most of the recently taken branch targets anil 
the performance was 2 lo 5 percent less than Ihe 1K byte 
buffer. 
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Given the cost constraints, designing the off-chip cache was 
still fairly straightforward. As explained above, we wanted to 
have a single cache structure to hold both instructions and 
data. We had the choice of designing a unified cache that 
allows instructions and data to reside anywhere in the cache, 
or a logically split cache that divides the structure into dis- 
tinct halves by using an address bit to distinguish between 
instructions and data. Unified caches have the advantage 
that they dynamically allocate more or less of the cache to 
instructions or data as appropriate for the application. This 
feature gives them slightly less than a 1% advantage for most 
benchmarks. Unfortunately, applications have a greater prob- 
ability of thrashing a miified cache by accessing the same 
cache index for both instructions and data Because of the 
potential thrashing problem and to make control algorithms 
easier, we chose to implement the logically split cache. 

Besides the combined cache structure, another interesting 
difference between the PA 7100 and PA 7100LC cache de- 
signs is the different DRAM configurations. The PA 7100LC 
is designed to access one double word (eight bytes) per 
DRAM access to allow smaller systems to be built with only 
nine eight-bit-wide parts, whereas most PA 7100 systems 
access two double words per DRAM access and require at 
least eighteen eight-bit-wide parts. The implication of this is 
that PA 7100-based systems can buffer the DRAM data and 
return a double word every two cycles, which matches the 
two-cycle write time required to copyin to the cache. PA 
7100LC-based systems, on the other hand, are DRAM-limited 
and can return a double word only every three cycles. Thus, 
on a line copyin from memory, a PA 7100 will lock the cache 
for eight cycles (4 double words x 2 cycles/double word). 
Had the PA 7100LC leveraged the PA 710()'s control algo- 
rithms, it would have locked the cache for 12 cycles ( 4 
double words x 3 cycles/double word) for every cache miss. 
We found lhat by changing the control algorithms and open- 
ing one-cycle windows to the cache during copyins, we could 
allow loads, stores, misses, or prefetches to occur and we 
gained over 10% in overall performanc e on most bench- 
marks. This large increase indicates how seemingly small 
changes between processors can have dramatic effects. 

Performance 

Our decision to integrate the memory controller onio the 
CPU die required that we carefully consider Other perfor- 
mance features in that light. Features generally take up 
silicon area, and at our target 14*14 mm-, chip area was at 
a premium. We needed to ensure thai our design would con- 
tinue to fit into our target die size. We also needed to mini- 
mize the active area on the die to reduce the cost of the pro- 
cessor. 

These considerations led us to search for simple means to 
free area on the die that had little impact on performance. 

Floating-Point Unit. The floating-point performance of the PA 
7100 was so strong that we had the option of H ading sonic 
of ii away to reduce cost for the PA 7I00LC About 25 mm- 
of the PA 7100 die area was devoted to the floating-point 
data path. Performance simulations indicated that if we 
copied it unchanged into the PA 7100LC it would achieve a 
performance of at least 130 for SPECfp92 at so MHz. 



We considered several schemes for sc aling back the floating- 
point unit. One idea was to delete Ute divide and square-root 
block. Divides and square roots would l>e implemented in 
hardware by iterating through the multiplier with a Nevvton- 
Raphson 6 or Goldschmidt ' algorithm. The performance loss 
for this change would be negligible, and we would save 1.5 
mm- However, it would liave introduced a significant amount 
of new complexity to the multiplier and the area saved w as 
not that great considering the excellent compactness of the 
existing divide and square-root block. We decided that the 
area saved was not worth the schedule risk, so we kept the 
divider. Complexity is very difficult to quantify, but as a proj- 
ect moves through its development cycles, an earlier decision 
to simplify" something is almost always remembered with a 
feeling of great relief. This decision was no exception. 

Another proposal for reducing area was to fold the multi- 
plier array. Multiplication on the PA 7100 is performed in 
four phases (two clock cycles). The partial products are 
summed during the middle two phases by a tree of dynanuc 
carry-save adders (see Fig. (5a). If we used a smaller tree of 
carry-save adders, single-precision multiplies, with their 
24-bit mantissas, could still be evaluated in one pass, but 
double-precision multiplies, with their 53-bit mantissas, 
would go through the tree twice (Fig. 6b). We found that 
folding the multiplier array for the PA 7100LC would save 
about 3 mm-, but increased the overall double-precision 
multiply latency from two cycles to three cycles. 

Low-level graphics software can be sensitive to floating- 
point latencies, so we consulted our partners in the graphics 
software lab. They determined that folding the multiplier 
array would be acceptable for the HP 0000 Model 712 work- 
station because the relevant software used mostly single- 
precision math. We simulated the effect of the higher la- 
tency on some of the SPECfp benchmarks. The geometric 
mean of the benchmarks lost less than 7% performance, but 
I he losses for individual benchmarks varied widely, from 196 
to 13%. We had some concent about the variance because 
large customers frequently use their own benchmarks, some 
of which are bound to be sensitive to double-precision mul- 
tiply performance. But even 13% was judged to be an accept- 
able trade-Off for the area involved, so we decided to fold 
the multiplier. 

By the end of I he project we found lhat 3 mm- was not 
nearly as valuable as we first thought il would be. However, 
I he area saved by folding the multiplier was removed from a 
critical chip dimension shared with the new memory and I/O 
controller, so the decision to fold the multiplier was solid. 

We simplified the floating-point conl roller by stalling the 
pipeline unconditionally during the execution of a divide, 
square root, or double-precision multiply. On the PA 7100 a 
divide or square root conditionally stalls the pipeline imtil a 
subsequent instruction tries to use its result . However. I his 
conditional divide stall was a source of bugs late in the de- 
sign cycle Of that chip, so this simplificalion positively af- 
fected the PA 7100LC schedule. 

The performance loss for this change was estimated at 1% for 
divide ;uid square root and 2% for double multiply. The area 
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Sticky computes the sticky bit, which is part of the IEEE 
floating-point standard. 
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Fig. 6. (a) The floating-point multiplier used in lite PA 7100 
processor, (b) The floating-point processor used in the PA 7W0L.C. 



Savings was small, hut Hip savings in complexity persuaded 
us to make litis change. 

The performance loss because of the floating-point changes 
turned out to be as small as we expected. Floating-point 
performance is often dominated by cache size and memory 
latency. PA 7100LC'-based systems typically have a smaller 
cache but. faster memory than I'A 7100-based systems. The 
final product achieved over 120 for SPECfp!)2 at 80 MHz, 
which was significantly higher than the competition and 
compares surprisingly well with the larger and faster PA 7100 
floating-point performance. 

Dual Issue. PA 7100l,("'-based systems needed to perform as 
well as midrange PA 7100-based systems on integer code, 
but with smaller caches and a lower CPU frequency. Super- 
scalar* execution is a classic method of improving perfor- 
mance at a given frequency. The PA 7100 has superscalar 
execution so much of the control infrastructure was already 
in place for our needs. However, the PA 7100 has only one 
integer and one floating-point execution unit, allowing only 
floating-point code to be accelerated. Performance goals for 
the PA 7100LC were focused on integer applications, so we 
investigated the possibility of adding a second integer 
execution unit for "integer dual issue." 

Our aggressive schedule allowed very little time to investi- 
gate the addition of a second integer execution unit. We 
identified three options for the classes of instructions we 
might be able to execute in parallel. For each option we esti- 
mated the cost in engineering time, area, and possible impact 
on our time to market. The benefits of each option were 
predicted using simulation of benchmark instruction I races. 

Loads and si ores typically represent about 40% of all instruc- 
tions executed, so I he first option was to split the existing 
integer execution unit into one that could do loads and 
stores and one that could do everything else. This would 
enable us to execute a load or store in parallel with some 
other type of instruction. The second option added a full 
ALL' in the load and store unit which would also allow two 
arithmetic or logical instructions to execute at a time. The 
I bird option added a specialized way to execute two loads 
or two stores that happen to be referring to adjacent mem- 
ory locations. 

The performance of each option was not trivial to estimate. 
The benchmarks were compiled for current machines with 
one integer unit so the compiler made no attempt to sched- 
ule instructions in such a way that adjacent instructions had 
no data dependencies. This led to lower performance esti- 
mates than we would have expected with an optimized com- 
piler. The performance lab addressed tliis problem by reor- 
dering the instructions within each trace before simulation. 
The reordering tool scheduled Instructions to avoid func- 
tional unit contention and data dependencies, using a range 
of assumptions aboul our future compiler technology. 

Performance improvement was measured on six bench- 
marks from SPECint92 and TPC'A. The first option, load and 
store plus ALU operation, gained 1% to 7% performance im- 
provement for the benchmarks using conservative compiler 



CPU architecture that allows the execution of mote than one instruction in a single clock 
cycle 
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assumptions and 9% to 23% using optimistic compiler as- 
sumptions The second option, supporting two ALU opera- 
tions and the first option, gained another 1% to 9% perfor- 
mance improvement. The third option, additionally 
supporting two loads or stores, gained another 1% to 3% 
performance improvement. At the time, the performance 
gain for these last two options seemed discouragingly low 

The cost of the first option was estimated at about one engi- 
neering year of effort and 3 ram- of area. The second full 
ALl ' would add a few more months of effort and less than 
1 mm 2 of area. The double load and store option would add 
a few more months of effort but no significant area. Perhaps 
the greatest cost factor was schedule risk because of in- 
creased complexity. Functional bugs late in the design cycle 
can affect time to market, and similar func tionality issues 
had been a source of bugs on other c hips. However, experi- 
ence gained with the superscalar PA 7100 design made us 
confident that adding integer dual issue would not limit our 
schedule. 

Ultimately, concent about our competition led us lo imple- 
ment all three options. Also, while the performance improve- 
ment estimates on SPEC'int92 might seem small, some lughly 
luncd applications can derive enormous benefit. One example 
is the software MPEG video decoder described in the article 
on page 60. The HP 9000 Model 712 can display MPEG video 
with stereo audio at lull frame rate without special-purpose 
hardware, and a significant part of this achievement comes 
from the PA 7100LC executing two integer ALU instructions 
at a time. 

Architectural Enhancements 

We added three new architectural features to the PA 710ULC 
implementation: little-endian addressing, uncachable mem- 
ory pages, and multimedia instructions. The first two fea- 
tures are presenl in several of today's microprocessors and 
represent the evolution of modem RISC arc hitectures. Little- 
endian addressing allows for more efficient execution of 
code compiled for other platforms and enables the use of 
new multivendor operating systems such as Windows NT. 
Uncachable memory pages increase the efficiency of code 
sharing cache lines between the processor and I/O and is a 
less expensive solution than implementing systems with 
coherent I/O. 

The multimedia features are more specific to the PA 7100LC. 
In late 1901. HP erealed a multidivisional team of hardware, 
software, and architecture experts responsible for creating 
the technologies that would enable a low-cost workstation 
to be multimedia capable without the cost of dedicated 
multimedia hardware. At that time, many standards for 
video compression were emerging. Of these, JPEG ( Joint 
Photographic Experts Group) and MPEG (Moving Picture's 
Experts Group) looked most promising for still-fratne and 
full-motion video respectively. Since workstations serve as 
decode-only clients in most environments, the team decided 
lo focus on building an efficient decompression engine, 
while leaving the more complex task of video compression 
to be done offline or by high-end servers. 

Initial experiments with JPEG and MPEG performance were 
done using public domain software running on an IIP 9000 
Model 720 workstation. Even after extensive algorithm 



changes and software enhancements, the performance was 
still far below the ultimate goal of real-time video at 30 
frames/ s. One time-intensive component of the encode and 
decode algorithms is the discrete cosine transform (DCT). 
The DCT requires a large number of multiplies and adds, 
weighted differently depending on the algorithm. Since PA- 
RISC' directly supports multiply instructions in the floating- 
point unit but not in the integer unit, we initially used float- 
ing-point arithmetic for the DCT and found algorithms that 
could take full advantage of the multi-operation FMPYADD 
(floating-point multiply and add) instruction. 

While the floating-point unit was efficient at providing a 
multiply and an add in a single cycle, it was inefficient at 
packing and unpacking data, normalizing results, and satu- 
rating results to maximum or minimum values. Thus, we 
found that a lot of time was spent converting values between 
integer and floating-point representations to accomplish both 
the multiply-adds and the data manipulations. To eliminate 
the conversions, we investigated the possibility of adding a 
multiplier to the integer data path but found the area require- 
ments to be prohibitive for a low-latency, 16-bit or 32-bit 
multiplier. Given that JPEG and MPEG operate on 8-bit data, 
building an 8-bit multiplier might have been feasible but 
extra instructions for normalization of intermediate results 
would have been required. 

PA-RISC' has always provided shift-and-add instructions as 
primitives for software emulation of integer multiplication. 
These instructions shift a register value left by one. two, or 
three bits and add the result to a second register value. 
Using these instructions, multimedia software can multiply a 
16-bit value by an 8-bit constant with a sequence of one to 
three instructions. We found that by picking a DCT that was 
biased away from multiplications in favor of additions, the 
shift-and-add instructions provided good performance com- 
pared lo the other options mentioned above. The deciding 
factor, though, was the ability to add parallelism to the shift- 
and-add instructions along with the normal adds. 

As mentioned above, JPEG and MPEG operate on 8-bit data 
and it is convenient to store intermediate results as 16-bit 
values. Thus, it seemed reasonable to split the :i2 -bit data 
paths of the ALUs to achieve two 16-bit operations per ALU 
per cycle. With a slight redesign of the integer ALU. it was 
possible to break the carry chain, force cany-ins as neces- 
sary, and allow for proper preshi ft ing of both 16-bil values 
packed in the 32-bit registers. We also changed the pre- 
shifter lo allow the shift-and-add operations to support divi- 
sion by allowing for tight-shifts as well as left-shifts. Given 
the PA 7100LC's dual ALU design, these hardware changes 
allowed us to achieve four 16-bit adds, subtracts, orshift- 
and-adds per cycle. This brought us closer to our design goal 
of 30 frames/s for video decode, but more work was needed. 
The next step was lo add Saturation logic to the ALU. When 
adding pixel or audio values, it is often desirable to "clip" 
the result to Ihe smallest or largest possible value as a result 
of underflow or overflow, respectively, (This is called arith- 
metic saturation. ) By specifying a completer 1 ' to the new 
lli-bil instructions, the hardware can be set to saturate the 



" A completer is a part of the instruction mnemonic ipecifvmg an option fot example, in 
Idw.m the complete! m specifies address modification for Ihe load woid (Idwl instruction 
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result automatic-ally using cither signed or unsigned arithme- 
tic. We also added an instruction to calculate an average by 
adding two registers and shifting the result right hy one bit. 
Averaging is used in MPEG and other algorithms to interpo- 
late between two values. 

Once again, these new features were incremental changes to 
the integer ALU design, resulting in very little area overhead 
and no critical speed paths. Csing these new features, an 
80-MHz l>A 7100LC can achieve MPEG decompression rates 
of-'iO frames/s with no sound using GIF (352 by 2-10) resolu- 
tion. With full stereo sound, a rate of 25 franies/s can be 
achieved. The PA 7100LG is the first processor capable of 
achieving these rates without the added expense of dedi- 
cated multimedia hardware. The article OD page 60 de- 
scribes these multimedia features in more detail. 

Conclusion 

Correctly deciding which features should (and should not) 
be included in a product is fundamental to the product's 
success. Design decisions are often strongly connected and 
often require appropriately crafted supporting design meth- 
odologies. Processor designers must make design decisions 
in areas such as package technology, degree of integration, 
cache organization, number of execution units, pipeline or- 
ganization, and floating-point functionality. With the 
PA 7100I.C processor, Hewlett-Packard has demonstrated an 
ability to make design decisions in a manner that leads to 
products having a strong competitive position in the areas of 
cost, performance, quality, and time to market. 
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Design Methodologies for the 
PA 7100LC Microprocessor 

Product features provided in the PA 7100LC are strongly connected to the 
methodologies developed to synthesize, place and route, simulate, verify, 
and test the processor chip. 

by Mick Bass, Terry W. Blanchard. D. Douglas Josephson. Duncan Weir, and Daniel L. Halperin 



Engineers who wish to create a leading-edge product with 
competitive performance, features, cost, and time to market 
are often challenged to create design methodologies that 
w ill enable them to succeed in their task. Decisions about 
the features of a product usualh, hav e an inseparable impact 
on the methodologies used to create, verify, debug, and lesl 
the product. 

During the developmenl Of the PA 7100LC microprocessor, 1 - 
engineers crafted several methodologies thai supported the 
design decisions lhat were made throughout Che project 
and provided the framework for implementing the design 
decisions. 

This article explores several of these methodologies. For 
each methodology, we discuss the design decisions lhat im- 
pacted the methodology, the alternatives thai we considered, 
and the course that we chose. We discuss the results pro- 
duced by each methodology, as well as problems that we 
encountered and overcame during each methodologv's de- 
velopment and use. 

Some of the design decisions lhat motivated us lo develop 
new design methodologies for the PA 7100LC arc discussed 
in the article on page 12. The areas in which we developed 
lliesc methodologies include control synthesis, place and 
route, production lesl, processor diagnosability, presilicon 
verification, and poslsilicon verification. 

The resultant methodologies wen- crucial lo our ability lo 
meet the design goals thai we had set for I he PA 7I00LC. 
Taken together, they enabled good decisions leading lo a 
successful product implementation. 

Synthesis and Routing Methodology 

The control circuits in any microprocessor typically represent 
a major portion of the complexity of the chip. The control 
circuits of the chip contain most of the chip's intelligence. It 
is these circuits that direct the rest of the components on the 
chip. The operation of the contra] circuits is similar lo the 
way operators of complex machines on a factory floor con- 
trol the way Ihal those machines behave 

Blocks of control circuitry perform similar jobs, and the 
nature of these jobs determines the nature of the control 
blocks themselves. Control blocks typically implement logic 
equations, the outputs of which control some Other funclion 
present mi the chip. The logic equations implemented by 
control blocks lend lo be irregular and loosely structured, A 



necessary characteristic of any control block is for its out- 
puts to become valid in sufficient time to control its down- 
stream circuits properly. Like other portions of the chip, 
control blocks can have timing paths lhat limit the overall 
chip operating frequency if the blocks are not carefully de- 
signed and implemented. 

Another characteristic of blocks that implement control 
logic is Uiat they change frequently throughout the design 
process. Experience has shown that a vast majority of bugs 
are found in (he control blocks, probably because so much 
Of the chip complexity resides there. We have found that il is 
very likely thai the last bugs fixed before a chip design is 
sent to manufacturing will be in these blocks. 

When we were defining the melhodologv for implementing 
the control circuitry for the PA 7100LC, we considered these 
general characteristics, as well as specific new requirements 
that stemmed from our design goals for the project. The PA 
7100LG had new requirements, compared to earlier CPUs, in 
the areas of low power dissipation and support of Ippy lest- 
ing. We knew thai the PA 7III0LC control would be even 
more complex than past CPUs because of its high level of 
integration and its superscalar design. To make it easy lo 
accommodate this new functionality, we wanted lo be able to 
make the control blocks as small and as flexibly shaped as 
possible Finally since we were leveraging the design of the 
PA 7100LC processor from I he PA 711)0 processor, 3 " 1 we 
want ed to leverage control equal ions or cont rol circuitry 
from I he past design for many of the blocks. 

The control of the PA 7100. from which we were leveraging, 
is primarily implemented as a programmable logic array 
(PLA). Programmable logic arrays have very regular physi- 
cal and timing characteristics. The PLA architecture used in 
the PA 7100 involves dynamically precharged and pseudo- 
NMl IS circuits. The outputs of Ihis PLA become true at least 
one CPU state after ils inputs became valid. The PLA latches 
all inputs with respect lo a specific fixed clock edge. 

PLA Methodology, the melhodologv used in design PI As for 
the PA 7100 was well developed as were the tools Ihal were 
necessary to support il. PLAs were designed in a high-level 
language with a syntax reminiscent of the Pascal program- 
ming language. In-house tools were available to translate the 
high-level source language lo optimized Boolean sum-of- 
producls equations, i Mhcr in-house tools were available lo 
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use these sum-of-products equations Id generate I lie FLA 
artwork (including prograniniing the array). 

When Ihe destination ciretuts could not tolerate die one-stale 
delay required by the PLA core, we created schematics for 
handcrafted standard-cell blocks that could calculate their 
outputs in the required time. We then used an in-house chan- 
nel router to create artwork for the standard-cell blocks. 

The FA 7100 PLA methodology had several advantages. The 
PLA design and implementation tools were simple and well- 
understood. They provided a turnkey artwork generation 
solution from the high-level control equations, which made 
it easy to accommodate late changes, Most important, we 
already had a high investment in this methodology. We un- 
derstood it very well, had all the required tools in place, and 
knew we wouldn't find any surprises. 

However, when considered in light of die requirements of 
the PA 7100LC, the PLA methodology had several disadvan- 
tages. Although Ihe physical structure of a PLA is fixed and 
very regular, its fixed shape would lead to difficulty in floor 
planning for a chip as luglily integrated as the PA 7100LC. We 
also knew that PLA implementations of control logic do not 
yield optimal circuits with respect to absolute size. PLA < n 

cotes involve both precharged logic and pseudo-NMOS logic, 

leading to high power dissipation relative to fully static cir- 
cuits. PLA circuits are also incompatible with our Ipoy test 
methodology, which Ls described later in this article. Al- 
though PLAs can usually guarantee a one-state delay from 
input to output, their liming is inflexible. The addition of 
hand-designed standard-cell blocks lo address this problem 
is not only labor-intensive, but also adds complexity to the 
overall solution and increases Ihe probability of intro- 
ducing bugs in Uiese areas. Also, some types of control logic 
cannot be represented compactly in the sum-of-products 
form required by the PLA methodology. This logic must then 
either be moved into a standard-cell block or redesigned. 

New Methodology. Since ihe disadvantages of the PLA meth- 
odology would compromise our ability Co achieve our design 
goals, we began to investigate alternatives. We had some 
positive experience with using Synopsys, a commercial syn- 
thesis lool, on the floating-point control block of the PA 
7100. We began to investigate the potential impact of com- 
bining automated synthesis using Synopsys with an over-the- 
cell router.t Our investigation of combining Ihe synthesize 
and route methodology pointed out the following advan- 
tages and disadvantages: 

• The absolute size of the blocks produced would be smaller 
than the blocks produced using either PLAs or channel- 
routed blocks. Additionally, the floor plan would be more 
flexible than that produced by a PLA. allowing us to parti- 
tion the controller so diat we could create control blocks 
that fit into available area close to the circuits ihey must 
control. 

• We would have to pay more attention to timing because we 
would no longer have the regular liming structure of the 
PLA to guarantee that state budgets would be satisfied. 

• The circuits produced would dissipate less power than 
corresponding PLA implementations because the synthesize 
and route methodology uses fully static circuitry. The circuits 
would also be IppQ compatible. 

t Ovei-the-cell louters place and toute cells so that there 15 less need Id p/ovide routing 
channels between the cells 



• We would have to design a new library of standard cells that 
would be compatible with the over-lhe-cell router. We would 
also need to design a new set of drivers that would drive 
output signals from the standard-cell core to the rest of the 
chip and that would be compatible with our production test 
design rules. These tasks were very well-defined and we 
understood the effort thai would be required lo complete 
them. 

• Of greater concern was the realizalion that the synthesis 
path from Ihe input equations to completed artwork would 
be more complex than the corresponding path in the PLA 
methodology and would be almost completely new. 

With the PLA methodology, we knew thai Ihere would be no 
surprises. Incorporating this new technology would remove 
much of thai certainty. However, ihe benefits clearly out- 
weighed the costs. We felt that we couldn'l afford to compro- 
mise our power, area, timing, and test goals by continuing 

with the pl\ methodology. 

We overcame several issues while making Ihe new method- 
ology work for us. We leveraged the source code of many of 
the control blocks from the PA 7100. all of which were spe- 
cified in the PLA source language. We were able to leverage 
existing PLA sources directly by using the PLA tools to gen- 
erate sum-of-product equations in a form thai the Synopsys 
synthesis tool could understand Synopsys was then free to 
massage Ihe equations into a more optimal form. Source 
code development of these leveraged control blocks Contin- 
ued using the PLA source language, even though we were 
using the new methodology for synthesis and route. We de- 
veloped control blocks that were new for the PA 7100LC 
using the Verilog behavioral description language, which has 
a more direct input path to Synopsys. 

We chose the C'ell3 router from Cadence Systems Inc. to 
perforin the place and route portion of our new methodol- 
ogy. The main issue remaining was how to integrate this 
new tool with our other tools. To minimize the number of 
cosily licenses we needed to purchase and to maximize the 
block designers' productivity, we decided to use our existing 
artwork editor as a front end to the router's floor planning 
capability. This approach allowed designers to preplace crit- 
ical cells, power nets, and clock nets easily. We developed 
new tools that would translate this floor plan into a Conn 
that the Cell-"! router could understand. While these tech- 
niques maximized designer productivity and minimized li- 
cense cost, we found that it was sometimes difficult to iso- 
late bugs in Ihe methodology to either our front-end tools or 
to the CeUS router itself. 

We also discovered that the timing capabilities of Ihe ver- 
sion of Synopsys that we used were less robust than we had 
believed at the beginning of the project. This discovery had 
only a minimal impact on blocks that were leveraged from 
PLAs because of the regularity in the timing of those blocks. 
However, to ensure robust timing on the remaining blocks, 
we needed to develop new tools. The need for these unan- 
ticipated workaround tools had a negative impact on our 
schedule. 

As with PLAs. we also found that certain types of circuits do 
not map well to the synthesize, place, and route methodology. 
On a large block where we made much use of the timing 
flexibility offered by static standard cells, we found that our 
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Fig. 1. A simplified block diagram 
of the PA 7100LC showing the 
relationship between the control 
blocks and (he other major blocks 
m l he processor. The instruction 
execution and pipeline sequenc- 
ing control block consists of four 
separate blocks that are physi- 
cally distinct but highly inlercon- 
nected. Not all of the control con- 
nections on the PA T100LC are 
shown in this figure 



synthesis tools wen- sometimes unable to produce circuits 
I hat mei lite liming and area constraints of the block. When- 
ever this occurred, we had to redesign the control source so 
thai i he synthesized circuits could mcei their physical re- 
quirements, or help the tools by hand-designing portions of 
the circuit. 

We found thai on some of the standard-cell blocks leveraged 
from the PA 7100. the synthesis tools hat I difficulty creating 
circuits that performed as well as their PA 710(1 counter- 
parts. This difficulty was caused in part by differences in the 
Standard-cell libraries for the two chips. The PA 7100LG li- 
brary had no pseudo-NMOS circuits, which were used quite 
effectively to meet timing on the PA 7100 lal the expense of 
higher power dissipation). The rest of the difference lies in 
the fact that, for all its sophistication, automated synthesis 
is still no match for carefully hand-designed blocks. Fortu- 
nately, our design tools allowed us to hand-design portions 
of the block while synthesizing the rest of the block. Al- 
though time-consuming, we chose this approach in cases 
where the tool path was unable to provide a satisfactory 
solution. 

The overall results of the meihotlology we chose were good. 
We were able to partition the PA 710(>I.("s control function- 
ality into seven primary control blocks. Foul of the blocks 
control the sequencing iuitl execution of instructions by I he 
pipeline. The remaining I hrec control blocks control the 
memory and t/Q .subsystem, the cache subsystem, and the 
floating-point coprocessor (see Pig. 1 ). Together, these 
seven blocks represent only bl'Kioflhe total die area, and 



implement nearly all of the control algorithms and protocols 
used by the PA 7100LC. 

Even though the PA 7100LG adds integer superscalar execu- 
tion and a memory mid I/O controller compared to the PA 
7100, Che area of the control core produced by the new 
methodology is aboul half the area of the PL-\ core of the PA 
7100. The area occupied by Ihe driver slacks in Ihe conlrol 
blocks on the two chips is about the same. 

The new methodology implemented all of Ihe conlrol blocks 
correctly and introduced no functional bugs. The liming 
methodology that we had in place bj the end of the project 
was very effective at identifying problem liming paths be- 
fore they made ii onto silicon. When we received chips front 
manufacturing, we found no problem liming paths in any of 
the control blocks that were created using Ihe new method- 
ology. 

Verification Methodology 

I )ne of the most prominent design goals for lite PA 7100LC 
was to meet the schedule required to enable a very steep 
production ramp. This goal, coupled with Hewlett-Packard's 
commitment in quality, mean! thai we needed to have in 
place a solid plan to verify Ihe correctness of the chip at all 
stages of its design. 

( >ur design goals and Ihe knowledge I hat Ihe PA 71000LC 
was to be the most highly integrated CPl' thai IIP had ever 
created led us to focus early on the methodology that we 
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Fig. 2. i iverview of the functional verification process. 



would use to verify the chip. As shown in Fig. 2, our verifi- 
cation methodology included several distinct forms of veri- 
fication, some of which occur before silicon is manufac- 
tured (presilicon verification) and some of which occur 
after first silicon appears (postsilieon verification). 

Presilicon verification activities included: 

• Creating software behavioral models through which we 
could verify the correctness of either t he entire design or 
portions of it 

• Creating switch-level models of the implementation to 
ensure that the implementation matched the design 

• Writing test cases that provided thorough functional 
coverage for each of these models 

• Using in-circuit emulation to increase vector throughput and 
to provide an orthogonal check of the chip's correctness. 

Postsilieon verification activities included: 

• Augmenting functional coverage by running hand-generated 
test cases, randomized test cases, and application software 

• Testing actual silicon against its electrical specification 
using a rigorous electrical testing procedure. 

We designed each portion of our verification methodology to 
ensure that we could meet our schedule and quality goals. 
The following sections describe in more detail the types of 
verification w r e used. 

A New Strategy 

At the time work was starting on the development of the 
PA 7100LC chip, HP was moving toward a new product 
development philosophy, which had as its basis the fact that 
HP could no longer afford to do everything for itself. The time 
had come to specialize in core competencies and look to 
outside vendors to cover the needs common in I he industry. 
Unless HP provided a dear competitive advantage over in- 
dustry-standard tools and methods, design teams were en- 
couraged to adopt these standards, paying others to develop 
and maintain leading-edge tools and processes. 



During the PA 7100LG investigation phase, engineers investi- 
gated industry-standard tools in the areas of behavioral simu- 
lation, static- Timing analysis, fault grading, timing verification, 
switch-level simulation, and oilier areas of chip verification. 
The first and foremost goal of these investigations was to 
determine which tools provided the fastest and most effi- 
cient contribution toward design and verification, ultimately 
leading to earlier products. The following section will pro- 
vide an analysis of our behavioral simulator select ion. which 
is just one example of the many tool decisions we made for 
the PA 7I00LC. 

Behavioral Simulation. Before the PA 7100LC development 
effort, we had been using a proprietary simulator which was 
written and maintained by an internal tools group. With the 
standardization of simulation languages in the industry, we 
questioned the value of high internal development and main- 
tenance costs for this tool. We investigated the language and 
simulator options available in the industry and eventually 
reached a final list of choices: 

• The proprietary HP solution 

• Verilog 

• VHDL (IEEE standard 1076). 

Other HP design labs, responsible for graphics and IC hard- 
ware design, had migrated to Verilog from the IIP simulator 
and had found significant improvements in simulation 
throughput on their ASIC designs. The throughput disadvan- 
tage of the IIP simulator was somewhat balanced by the fad 
that it carried no licensing fees, was fully robust, and had 
been proven capable of simulating a large, custom IC design 
such as a CPU. 

Verilog had become a de facto standard in the U.S. for high- 
level and gate-level simulation in 1992 and had been used 
extensively in HP's graphics hardware and IC design labs. 
Their experience indicated that Verilog was very robust and 
that it allowed personalized extensions through linking with 
C code. The IC design lab demonstrated simulation speeds 
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will) Yerilog that were about seven times faster than the 
internal HP simulator. Since V'erilog was becoming more 
common within IIP. it would ease our task of sharing ami 
combining Simula! ion models with design partners. For ex- 
ample, the floating-point circuits that we would be leverag- 
ing from (he PA 7100 for the PA 7100LC were modeled in 
Yerilog. The graphics chip and the LASI chip used in the 
Model 712 workstation were being developed using Yerilog. 
and many of the commercial It's used in the system had 
Yerilog models available for system simulation. By choosing 
Yerilog, we would create a homogeneous environment. We 
also felt that Yerilog's C-like syntax would allow engineers 
to leant the language quickly. Finally, the Yerilog language 
would provide a bridge to other useful industry -standard 
tools for static timing, fault grading, and synthesis. 

At the time we were investigating simulators we found only 
one supplier who could provide a mature Yerilog simulator 
in our required time frame. This particular simulator had 
some disadvantages compared to our internal simulator, 
which included higher main memory requirements and the 
need to recompile the simulation model al each invocation 
of the simulator. For large models, this compile phase could 
last a full minute. The internal simulator, by contrast, com- 
piled die model once into an executable program which con- 
tained the simulation engine, and incurred no run-time 
Startup penalty. Also, because Yerilog was licensed we 
would have to purchase sufficient licenses to cover our sim- 
ulation needs, which would present a large initial expense. 

A third major simulation language we investigated was YIIDL 
(IEEE Standard 1070). While Yerilog was becoming a de facto 
standard in the United States. YIIDL was sweeping Europe. 
VI II >L shared many advantages and disadvantages with 
V'erilog. Simulation models of commercial system chips 
were often available in both languages. YIIDL provided 
hooks to support industry-standard tools for timing, fault 
grading, synthesis, and hardware acceleration. VHDL was 
also licensed and would be expensive. The primary differen- 
tiator between YIIDL and Yerilog was in ease of use and 
ease of learning. Other III' design labs indicated that VHDL 
was nunc difficult to leant and use than V'erilog. Also, there 
was no local expertise in VHDL. while proficiency in V'erilog 
had been growing, and significant inroads had already been 
made at integrating Yerilog into the remainder of our tool 
set 

With this infonnation in mind, the PA 7100LC technical team 
decided to use V'erilog as the modeling language for the PA 
7100I.C processor. The compelling motivations for this 
choice were: 

The demonstrated success of other IIP labs in using the 
Yerilog simulator in ASIC designs 
The availability of local expertise and support for the 
simulator and modeling language 

The ability to standardize on a single simulator and model- 
ing language for the development of all custom VLSI used in 
the IIP 9000 Model 712 

The ability to interface easily to other industry-standard 

tools. 



Given this decision, we joined an effon with other design 
labs to enhance the Yerilog simulator to include an im- 
proved user interface and more tool interfaces to be used 
throughout our verification effort 

Turn-on Process. VYe migrated to the Yerilog modeling lan- 
guage and simulator in two steps. First, we validated that 
Yerilog could simulate an existing PA-RISC design of compa- 
rable complexity to the PA 7100LC by converting the PA 
7100 simulation model (from which the PA 7100LC design is 
leveraged) into Yerilog. Second, we used the knowledge that 
we gained during this conversion process to complete the 
development of the PA 7100LC. 

Converting the PA 7100 simulation model into Yerilog was a 
good decision for several reasons. We wanted to stan with a 
known functional model from which we could leverage. We 
also needed to confirm that Yerilog was robust and accurate 
enough to model a design as large and complex as a ("PC 
The PA 7100 offered a hierarchical, semicustom design 
model that consisted of high-level behavioral blocks (e.g., 
the translation lookaside buffer) and FET descriptions (e.g., 
in custom leaf cells). This varied design would provide a 
good test of the simulator's ability and would help us to 
learn about Yerilog's unique requirements. 

To aid the conversion process, we created a tool that con- 
vened the IIP proprietary' modeling language to Yerilog syn- 
tax. We fixed code by hand wherever the two languages did 
not have similar constructs or where they evaluated similar 
consrnicts differently. The convened model passed its first 
test, case within two months. 

Once the PA 7100 model was up and running in V'erilog, we 
measured its simulation throughput. Instead of the expected 
7.- speedup, we discovered a full 4x slowdown compared to 
the I IP simulator. We also found that the model consumed 
more memory than we had anticipated. Through careful 
analysis and support from our supplier, we learned that 
much of our model syntax was very inefficient In addition 
to inefficiencies created by the translation tools, many syn- 
tax structures that were optimum in HP's simulator were 
nonoplimal in Yerilog. Profiling and correcting these ineffi- 
ciencies greatly improved performance and resource re- 
quirements. 

Results. The result of the decision to use V'erilog to model 
the PA7100LC was positive, with a few disappointments. 
The main disappointment was that the V'erilog model of the 
PA 710til.< ' achieved only parity in throughput and required 
five times more memory than the HP simulator. 

However. V'erilog brought strengths in other areas. V'erilog 
allowed us to make incremental changes to the model 
qUiCkly and easily. V'erilog enabled us to capitalize on indus- 
try-standard tools in the areas of synthesis, timing, fault 
grading, and in-circuit emulation. We were able to use a 
single modeling language across all of the custom compo- 
nents in the IIP 9000 Model 712 workstation and to obtain 
compatible models for many of the external components. 
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We soon learned to use the new strengths provided bjf Verilog 
and became efficient in using the language and the new sim- 
ulator. Verilog successfully modeled all constructs required 
in the PA 7100LC design, and a high level of Quality was the 
end result of using this tool. 

Presilicon Functional Verification 

Because the cost and lead time of manufacturing C'Pl 1 die are 
so great and because our system painters depend on fully 
functional first silicon to meet their schedule goals, it is im- 
portant that our presilicon verification methodology give us 
high confidence in the functional quality of the first silicon. 
This task proved to be a challenge for the PA 7100LC chip 
because it was designed by many engineers, and its feature 
set is extensive and complex. These factors introduced the 
opport unity for design and implementation bugs. 

The PA 7100LC is the first HP processor chip to integrate the 
memory and I/O cont toller on the same die as the CPU. In the 
past, these designs lived on separate die and were owned by 
separate project teams. The verification efforts for the two 
designs were mostly independent. A careful specification of 
the interface between the two designs allowed this approach 
to succeed. 

We realized that even though the PA 7100LC would integrate 
the memory and I/O controller onto the CPU die, it would be 
more effective to verify the memory and I/O controller sepa- 
rately from the CPU core for the majority of the tests. This 
would allow test cases for both the CPU and the memory and 
I/O controller to be more focused, smaller, and faster to sim- 
ulate than they would be in a combined model. We created a 
well-defined interface between the CPU and memory and 
I/O controller to enable this approach. 

Each of these presilicon verification efforts was structured 
as shown in Fig. 2. First we created a behavioral model for 
die portion of the design whose function was to lie verified. 
A behavioral model represents the design at some level of 
abstraction; and typically moves from very high-lev el to 
much more specific as the project progresses. As mentioned 
above, we chose Verilog as the modeling language for our 
design. 

The behavioral model was the heart of the simulation envi- 
ronment that would enable us to verily the CPU and the 
memory and I/O controller. Out job was to find deficiencies 
in this model. However, to do this we needed a way to stim- 
ulate the model, observe its results, and ensure that its be- 
havior was correct. To meet these needs, we created addi- 
tional software objects to complete the simulation 
environment. 

At each of the external interfaces of the behavioral model, 
we created custom code that was capable of modeling the 
behavior of the device on the other side of the interface and 
of stimulating and responding to the interface as appropriate 
for that device. For example, these stimulus-generating soft- 
ware objects were used in our simulation environment in the 
same way that dynamic RAM, external cache, and I/O devices 
are used in a physical system. We authored the code that 
models these objects in a high-level language (typically C). 



Another type of custom software that augments the simula- 
tion environment consists of checkers. A checker monitors 
the behavioral model and checks aspects of model behavior 
for correctness. We used a number of different checkers 
during I he PA 7100LC verification effort Some checkers 
were very focused (e.g.. a protocol checker on the I/O bus), 
and others were more global (e.g.. the PA-RISC architectural 
simulator). 

Creating "watchdog" pieces of code to delect and signal 
errors automatically in the simulation environment helped 
us to maintain our schedule. Previous ( 'PI s had an indepen- 
dent model of the design that matched the behavioral model 
state-by-stale for all external pads and architected internal 
state.* Creating the Independent model was time-consuming 
and not easily broken into small pieces that could be 
worked on in parallel. We couldn't run test cases on the be- 
havioral model without a fully functional independent 
model. Replacing this independent model with a collection 
of checkers allowed us to create multiple checkers at the 
same t ime. We were able to turn on the checkers indepen- 
dently as the functionality that they checked became avail- 
able in the behavioral model. Also, the checkers didn't need 
to be fully functional for us to run useful test cases. 

The final aspect of the simulation environment is the test 
case. A test case provides initialization to the model and the 
stimulus generating software objects and then orchestrates 
the stimulus generators to provide external stimulus while 
the model is simulating. The checkers constantly watch 
model behavior and identify rules that (he model violates. 
The test cases are not self-checking. They simply stimulate 
the model and rely on the checkers to ensure that the model 
responds correctly. 

We wanted the test cases to create the complex interactions 
in the CPU core and in the memory and I/O controller that 
are necessary to find subtle hugs. The model, stimulus gen- 
erators, and checkers provide an environment that makes it 
easy to generate short, powerful test cases. To improve test 
case coverage, we gave the responsibility for test case cre- 
ation to both the CPU ami the memory and I/O controller 
designers, who had a detailed knowledge of the internal 
operation of the chip, as well as to independent verification 
engineers, who knew only the external functional specifica- 
tion of the chip. We used design reviews to ensure that our 
suite of test cases adequately covered all functionality pres- 
ent in the design. 

Testing on the behavioral model is the first line of defense 
against flaws in a design. To ensure that our implementation 
matched the design, we ran our full suite of test cases on a 
gate-level behavioral model. We created this model from the 
complete chip schematics. We also tested a switch-level 
model that we created by extracting the FET nedist from 
the completed chip artwork. Since this was the same art- 
work that manufacturing would use to fabricate the chip, 
this regression served as a final test of the functional cor- 
rectness of both design and implementation. 

*In this 08*8* arrhiterted stale refers tO a particular patient of ones and wros 
on internal chip nodes. 
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To ensure thai there were no coverage holes in the interfac e 
between the CPl" and the memory and I/O controller, we 
created a model that merged these two designs into a single 
beha\iorai model of the entire c hip. We tested this model to 
gain certainty that both parts would work properly together. 

Finally, we combined behavioral models of the PA 7100LC 
with behavioral models of other chips in the system and 
performed system-level verification to ensure that each of 
the chips interpreted the interchip interfaces consistently 
and to ensure that all the chips in the system functioned as 
expected. 

Using this extensive verification methodology, the first silicon 
we delivered allowed us to boot the HP-UX* operating system 
and enabled our system partners to progress towards meeting 
their system schedules. 

Postsilicon Functional Verification 

Presilicon verification, while providing an excellent first 
pass at ferreting out design or implementation Haws, is not 
capable of identifying all bugs in a complex custom CPU 
such as the PA 7100LC. Two factors make this true. First, the 
simulation speeds of even high-level behavioral models (typ- 
ically less than 10 Hz) are not sufficient to exercise all the 
interesting state transitions within the CPU in the time avail- 
able. Second, experience has shown that in a chip of this 
type there are sometimes subtle differences between the 
presilicon model and actual chip behavior. 

To ensure a quality CPU design, we performed extensive 
postsilicon testing on the PA 7100LC in systems running at 
actual processor speeds ( ">0 to 100 MHz). The difference of 
about seven orders of magnitude in vector throughput, be- 
tween running lest cases on presilicon models and code 
running on actual silicon underscores the potential for 
thorough testing offered by postsilicon verification. 

One of the goals of presilicon testing is to ensure that the 
simulation model matches the behavior specified by the 
flesign. We carried this goal into postsilicon testing and ran a 
suite of tests on actual chips in a computer system. The 
tests behaved the same when they were run in the computer 
system as Ihey did on the PA 7100LC presilicon models. 

We knew that postsilicon testing would be the last opportu- 
nity to find functional problems with our processor before 
we shipped systems to customers. Since the cost of finding a 
serious functional problem once systems are shipped is ex- 
tremely high, we wanted to exercise the processor 
thoroughly with as many different tests as possible. The 
variety of features that we had added to the PA 71001/ ' 
made this process more difficult. Each of these features had 
to be tested, usually in combination with other features. 

The tests that we used during the PA 7100LC postsilicon 
verification effort included: 

• A collection of handwritten tests, run in ail environment 
that made them more stressful for (he processor 

• Random code generators that produced software thai 
deliberately stressed complex areas of the processor 

• A collection of application software including operating 
systems, benchmarks, and other applications. 



Handwritten Tests. Hew lett-Packard has created a library of 
progrants whose purpose is to ensure that a processor con- 
forms to the PA-RISC" architecture. In addition to this library, 
we created other progrants to test specific processor fea- 
tures. We also created a small operating system that allowed 
many of these programs to run simultaneously and repeti- 
tively in a manner that was stressful to the processor This 
operating system would interrupt the programs at different 
intervals and also change portions of the processor state 
(e.g. cache and TLB) l>efore restarting a program. Finally, the 
operating system kept an extensive log of program activity 
to help us track down bugs tltat it found. 

In addition to the programs diat we ran under the special 
operating conditions, we created another set of handw ritten 
tests specifically to test the memory and I/O controller por- 
tion of the processor. These tests used an I/O exerc iser card 
to ensure that the memory and I/O controller would behave 
property in the presence of any conceivable I/O transaction. 
We also used these tests to exercise the DRAM interface of 
the memory and I/O controller. 

Focused Random Testing. To supplement the handwritten 
tests we developed two random code generators. Experi- 
ence gained during pas) processor designs had taught us 
that a certain class of bugs appear only when a number of 
complex interactions occur within the CPU. It wasn't feasi- 
ble to create handw ritten tests to cover all of these interac- 
tions because the time requirements to do so would be pro- 
hibitive. Additionally some of the tests would need to cross 
so many interactions thai it would be difficult to guarantee 
adequate coverage with handwritten cases. Using a random 
code approach, we used code generators to create the test 
cases that found bugs in this class. 

Another strength of the random code approach was that we 
were able to take full advantage of the speed of postsilicon 
testing. We could run all handwritten tests in a short time on 
an actual processor. Random code generators made it pos- 
sible to generate millions of different tests to keep the pro- 
cessor fully exercised, at speed, for long periods of time. 

One could create many conceivable random code genera- 
tors, which could differ in many ways including the type of 
code produced, fault latency, ease of debugging, repeatabil- 
ity, anil initialization. Design differences In random code 
generators cause coverage differences (one generator may 
be able to find a bug that another missed). Random code 
generators mainly differ in the sequence of instructions and 
in what constitutes initial and final processor state. In gen- 
eral it is best to run code from as many different sources as 
possible to ensure t he best coverage. 

( ti the two random code generators thai we developed, one 
stressed the floating-point unit and another stressed the 
integer unit Each of these generators produced tests 
consisting of: 

• An initial processor state 

• A sequence of PA-RISC instructions 

• An expected final processor state. 

The focused random approach worked extremely well dining 
the PA 7100I.C verification effort. I 'sing it, we were able to 
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complete thousands of machine-hours of testing and identify 
a majority of postsilic-on bugs. 

Our decision to emphasize random code testing paid off. Be- 
cause of the proven effectiveness of the random approach, 
we will probably continue in this direction and make evolu- 
tionary changes to make the approach even more effective. 

Application Software. In addition to handwritten and random 
tests, we ran a variely of "real-world" software applications 
to further ensure that we had found and fixed all hugs. 
These applications were intended to help diagnose failures 
suspected to be caused by the hardware. We booted operat- 
ing systems (like HP-UX) shortly after chips were available. 
We also conducted long-term operating system reliability 
tests when more stable hardware and software became 
available. We filled out our array of application software 
tests with benchmark suites and other applications. 

Acceptance Criteria. A challenging question that engineers 
and managers face during any postsilicon verification effort 
is "When are we done?" Having clear criteria for the quality 
required to ship the chip to customers is paramount. For the 
PA 7100LC, we used the following acceptance criteria: 
All failures are diagnosed to root cause. 
No chip failures exist. 
All handwritten code works. 

Random code generators have run for a long time without 
finding any failures. 

Application software has nm without any indication of 
hardware bugs. 

In-Circoit Emulation 

In addition to constantly tuning existing design and verifica- 
tion methodologies in areas where high-impact productivity 
gains are essential to stay on the leading edge of the industry, 
we also look for new breakthrough technologies and areas 
for paradigm shifts. We considered in-circuit emulation as 
such an area for the PA 7100LC. 

In-eireuil emulation means that a chip is modeled at the gale 
level in field programmable gate arrays (FPGAs) and con- 
nected directly to a chip socket in a real system running at a 
reduced frequency. This allows the modeled chip to run real 
system-level software. 

Continual increases in chip complexity must be countered 
with more effective verification to ensure high-quality first- 
silicon chips. The goal is to have a perfect chip, but the re- 
quirement is to prevent masking bugs. A masking bug is a 
serious bug that causes a class of chip functionality to fail. 
The verification team is unable to "see behind" the bug to 
test for other failures in that area of functionality. The chip 
must be redesigned to fix the masking bug and must pass 
through fabrication before this functionality can be tested. 
Emulation was viewed as a way to prevent these serious 
masking bugs. 

Besides ensuring high-quality first silicon, it is also desirable 
to have enough presilicon simulation throughput to verify 
any proposed postsilicon bug fix. Since turning a chip is 
costly and time-consuming, incorrect bug fixes that cause 
additional bugs must be eliminated. 



During the early phases of the PA 7100LC chip design effort, 
in-circuit emulation technology came of age and was avail- 
able tlirough external vendors. We investigated this new- 
technology in depth. For us. in-circuit emulation was viewed 
as a paradigm shift in verification and very attractive because 
it would: 

• Provide near "real hardware" throughput with a presilicon 
model 

• Allow thbtOUgfa regression of any mask or full chip funis 
necessitated by bugs or timing paths found during postsilicon 
verification 

• Allow the Srmware and soft ware teams to test their code 
before real hardware w as av ailable 

• Add another important debugging capability to our suite of 
debug tools thai allow us to isolate postsilicon bugs 

• Allow us to recreate real hardware failures on a presilicon 
model and allow visibility to all internal nodes of the chip. 

We also saw some areas of concern in pursuing in-circuit 
emulation. We perceived in-circuit emulation as challenging 
and risky because it was a new technology within a very 
young industry. We lacked expertise in using emulation 
tools, and it would be expensive to gain the necessary ex- 
pertise to make in-circuit emulation pari of our chip design 
methodology. In addition to this, the emulation tools and 
hardware were very expensive. 

Our concern with technology risk was eased by several fac- 
tors. We were promised very strong (on-site) support from 
the emulation company that we chose. They assured us that 
tools capable of handling large designs would be available 
early in our design cycle. We had independent corroboration 
from other IIP entities, who had seen great success with 
emulation in ASIC design efforts. 

After weighing the potential advantages, risks, and our long- 
term needs we determined to pursue in-circuit emulation. 
We didn't believe that emulation was absolutely critical to 
our success on the PA 7100LO, but we felt that dramatic 
improvement in simulation throughput would be required to 
verify the increasing complexity of our next-generation pro- 
cessor design. This effort was simply the first step in a long- 
term strategic direction. 

Emulation Methodology 

The real goal of our emulation effort was to plug the emula- 
tion model into the physical system and nm at frequencies 
near 1 MHz. The team modified an IIP 9000 Series 700 work- 
station to provide the required boot ROM. disk, and I/O sub- 
system. A special processor board was designed that allowed 
the emulation system to plug into the CPU socket. This board 
also provided external cache (SRAM) and main memory 
(DRAM). One challenge was to keep the DRAM refreshed 
since the processor wasn't running fast enough to keep mem- 
ory refreshed and make forward progress on the code stream 
at the same time. We implemented a solution that coalesced 
the processor memory transactions between refresh cycles 
provided at a constant frequency by a module external to 
the CPU. This made refresh transparent to the PA 7I00LO 
emulation model. Fig. 3 shows our emulation setup. 
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Along with these physical c hallenges, we also addressed 
modeling issues. The emulation company provided an on- 
site, experienced engineer to join our emulation team. The 
preliminary goal was to take a Substantia] top-level block 
net list and prove that our style of custom design would emu- 
late successfully. We chose a block that contained many 
unique and difficult-tomodel elements. Il contained custom 
data path blocks and some control blocks, and included 
some large regular arrays such as register slacks, TLB. and 
internal cache. Because of their size and regular structure, 
we Chose to model the cache, register stacks, and TLB on 
external component boards using TTL parts and FALs. We 
turned to industry tools to translate our library of custom 
cells into emulation gates, but quickly found that the tools 
were incapable of generating accurate gate-level models. We 
weie forced to create handwritten translations for the en- 
tire library to make progress. 

Once we had Completed this initial block, we ran the model 
in cosimulation mode with a Verilog simulator. The emula- 
tion hardware modeled our target block, while the Verilog 
simulator modeled the rest of the PA 7100LC. The models 
exchanged stable input and output values after every CPU 
dock transition. This approach allowed turn-on and testing 
of the external component boards as well as flushing out of 
modeling issues. 

Next, we attacked the full chip. Our emulation team created 
a full chip model, which was partitioned and programmed 
into the FPU As in the emulation boxes. This became a pain- 
ful process as we learned that the hardware and software 
had never been used on a design of this size, and fatal tool 
failures stopped progress many limes. 

We achieved our first working model that ran through all the 
firmware code shortly alter the PA 71001.1 ' chip achieved tape 
release. We debugged all firmware code before first silicon 
arrived from fabrication. This made silicon lurn-on much 
[aster than would have been possible otherwise. We resolved 
some nagging emulation failure modes in the difficull-to- 
inodel floating-point circuits within one month of receiving 



the first silicon chips. This emulation model allowed exten- 
sive testing on the final chip specification before the masks 
were released to fabrication. Only one hardware bug was 
found using emulation. 

From our emulation efforts we learned the following: 

• Our method of custom VLSI design was difficult to model in 
emulation gates. Many unanticipated race conditions were 
found which had to be resolved. For example, we allow 
races (e.g., between a latch's data signal and its enable sig- 
nal) that we can guarantee will be won on the chip. How- 
ever, with uncertain delays on these signals within the 
FPU As, these races are easily lost. We also found that 
wire-OR logic is very difficult to model 

• We found that electrical characterization was the limiting 
issue for shipping products in volume. Emulation does not 
help this problem directly. Although il does help to prevent 
masking bugs, it may not actually shorten the ship-release 
date. 

• Even though custom VLSI chips are much more difficult to 
emulate than ASICs, in-cireuit emulation is a viable technol- 
ogy. As emulation technology matures, the effort required to 
model complex CPUs w ill become more reasonable. Because 
of the immaturity of in-circuit emulation technology at the 
time we were using it, we Were only able to make a minor 
contribution to the development of the PA 7100LC with this 
technology. 

The learning curve for emulation technology was steep, but 
this effort can be seen as successful when used as a step- 
ping stone to a new technology paradigm. We identified 
many issues and shortcomings with using current emulation 
technologies lo accelerate vector throughput. We can now 
continue to move towards either applying more mature 
emulation technology or developing new approaches that 
better address the issues that we identified. 

Postsilicon Electrical Verification 

The goal of postsilicon functional verification is to identify 
failures caused by inappropriate logic w ithin the chip. These 
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functional failures generally manifest themselves on every 
chip lliai We manufacture and will be unrelated to the oper- 
ating point (e.g., temperature, voltage, or frequency) of the 
CPU. 

Electrical failures are another class of failures (hat we sought 
out during the postsilicon verification effort for the PA 
7100LC. Electrical failures cause the chip to malfunction 
and typically have a root cause in some electrical phenome- 
non such as: 

Ground or power supply noise on the board or chip 
Coupling between signals 
Charge sharing 

Variation in FET speed or drive capability caused by 
variation in the manufacturing process 
Leakage related phenomena 
Race conditions 

Unforeseen interchip circuit interactions. 

Because the integrated circuit manufacturing process varies 
slightly with time, electrical failures may or may not be pres- 
ent on all chips that are produced. Fiuther. certain operating 
conditions will typically exacerbate the failure. Sometimes a 
failure will occur at any operating point and tan be difficult 
to distinguish from a functional failure. However, most will 
be dependent upon some parameter of the chip's operating 
point 

To deal appropriately with failures of this class, we staffed 
an electrical verification effort for the PA 7100LG that was 
mostly independent from its functional verification 
(described earlier). The goals of this effort were to: 
Identify, isolate to root cause, and repair all failures within 
the operating range possible in customer systems 
Identify and isolate to root cause any failures within a sig- 
nificant, well-defined region of margin outside of this oper- 
ating range. 

The first goal is clearly necessary to provide quality systems 
to customers. We created the second goal with the knowl- 
edge that in some cases, understanding the root cause for 
failures outside of our expected operating range would be 
beneficial. Sometimes this knowledge would enable us to 
make proactive design changes which would increase chip 
yields, resulting in lower chip and system costs. Such knowl- 
edge is also useful when moving the chip into a higher-fre- 
quency range or a new process technology. 

To meet these goals, we instrumented several systems so 
that we could independently control each of the CPU supply 
voltages and the operating frequency of the system. We inter- 
faced each set of controlling instruments to a host computer 
which could systematically vary the operating point parame- 
ters, direct the system under test to run a variety of possible 
tests, and Observe and log the results of those tests. We 
placed each system tinder test in an environmental chamber 
that was capable of varying the temperature from — 40°C to 
100°C. In each system under test, we also varied system 
parameters such as memory loading and I/O bus loading. 

In the presence of an electrical failure and the appropriate 
operating conditions, certain code streams will not evaluate 
as expected. To ease the task of isolating electrical failures, 
we created test code specifically for electrical verification 
that stressed the various interfaces and functional units of 



the chip in turn. Each segment of this test code would indi- 
cate its progress as it ran. This allowed us to isolate a failure 
quickly to a particular, very short segment of the test code. 

In addition to this electrical verification code, we leveraged 
the random code generators used by the functional verifica- 
tion team, and ran the code sequences that they produced at 
the comers of the PA 7100L.Cs operating region. 

Using this data generating and collection system, we were 
able to create graphs that indicated passing and failing code 
sequences as a function of voltage, frequency, temperature, 
system conditions, and IC process. By inspecting the operat- 
ing point dependencies (or lack of dependencies) of a failing 
code stream, we could gain insight into the root cause for a 
failure. To confirm our root cause analyses and potential 
fixes, we created new handwritten test codes, altered exist- 
ing silicon using focused-ion-beam milling, and performed 
electron beam probing of chips in systems. 

The PA 7100LC"s postsilicon electrical verification effort 
ensured that the chip would perform well in a wide range of 
electrical environments. It identified easily repaired yield 
limiters that allowed us to maximize yield and minimize the 
cost of die CPU. Each of these successes allowed our system 
partners and customers to be more successful in meeting 
their goals. 

Debug and Test 

Since the PA 7100LC processor was designed to be the core 
component of a low-cost workstation line, the factory cost 
goals and expected volumes clearly indicated that careful 
attention to ease of test and manufacturability was necessary. 
The following test features were defined based upon design 
and manufacturing needs: 

• Parallel test vector capability in excess of 100 MHz 

• IEEE Standard 1140.1 -compatible boundary scan interface 

• On-chip clock gating circuitry 

• Retention of internal state when the chip clocks are halted 

• Internal scan with single and double clock step capability 

• Fully static operation to support off-chip IpDy testing 

• Signature analysis capability for testing the on-chip 
instruction buffer 

• At-speed capture of internal states by scan registers. 

To meet manufacturing cost goals, the PA 7100LC had 
aggressive quality and test time goals compared with our 
previous processor designs. Both of these items significantly 
affect final chip cost. A test methodology was developed 
early in the design phase to facilitate the achievement of 
these goals. The methodology encompassed chip test and 
characterization needs and manufacturing test needs. 

Testing is accomplished through a mixture of parallel and 
scan methods using an HP 82000 semiconductor test system. 
The majority of testing is done with at-speed parallel pin 
tests. Tests written in PA-RISC assembly code cover logical 
functionality and speed paths and are converted through a 
simulation extraction process into tester vectors. Scan- 
based block tests are used for circuits such as standard-cell 
control blocks and the on-chip instruction buffer which are 
inherently difficult to test fully using parallel pin tests. Irjruj 
measurements are also performed after some parallel tests 



32 April I90G llewlcii-Pai kaid Journal 

©Copr. 1949-1998 Hewlett-Packard Co. 



Inverter 



Boundary 




Scan Drive 




Tnstale 




Enable 


T l 


Drive Clock 


Pad 


Pad Data 


Driver 

m 




Boundary T 
Scan Data 


C/-» n In 


Boundary 


Circuitry 




Large Small ^ 

Driver Holder V 

FETs FETs 



Fig. 4. Simplified diagram of a PA 7100I.C I/O driver. Static current can (low from \'\,[, to ground in the inverters if the pad is not driven 
to V[)i, or ground. For example, if the pad driver drives » one, the pad would be driven to 3.3V (VptJ. which would cause static current to 
now. invalidating the Ippy test. For Ihdq measurements, the pad is driven to OV (ground) tlirough Ihe boundary scan circuitry and pad 
driver. 



lo provide additional defect coverage. The parallel test se- 
quence is 600.000 stales long, and 42 Mbits of scan vectors 
are used during scan testing. 

To meet our test quality and cost goals, we implemented two 
new chip-test techniques thai had not been used on previous 
PA-RISC implementations: [jjdq testing and santple-on-the- 
fly testing. 

Iddq Implementation 

IpIXJ testing is a tesi methodology in which the presence of 
defects is delected by measuring dc current when Che chip is 
hulled. Nondefective full CMOS gates draw static current 
made up of leakage currenls that are in Ihe nA range. How- 
ever, defective gates can draw currents many orders of mag- 
nitude higher, [fa Current measurement is made on the 
power supply 'luting a sialic state, a good chip will draw 
very little current and a defective chip will draw much more. 
Ijj/QQ has high observability and detects many different types 
of delects. It was decided early in the design of Ihe CPU thai 
lfJl>Q 'est capability would be a desirable test feature. \\Hh/ 
best Capability was also desirable because it substantially 
reduces static power consumption. 

Design Rules. To support tnpQ testing, most Of the Circuits 
leveraged from past PA-RISC implementations that drew dc 
current were eliminated. For each case in which using a 
circuit thai drew sialic current was Ihe only reasonable de- 
sign solution, the circuitry was redesigned lo be disabled 
with a lest signal during Iqoq measurements. Most blocks 
containing pseudo-NM( >S Circuitry were redesigned using 
static CMOS circuitry. Dynamic circuits were modified to 
eliminate sialic current and lo relain stale while Ihe chip is 
hailed. No FET gale is allowed lo be in a silualion where it 
could final if the clocks are hailed because I his could possi- 
bly cause ihe FET lo Him on. Internal pullups on input pins 
are disabled during Inpy measurements, including the IEEE 
1 149.1 lesl pins. No drive fights are allowed in a static state. 
All nodes make a full transition CO a supply rail, which is 



accomplished through Ihe use of restorative static feedback 
when full CMOS transfer gates are not used in latches and 
multiplexers. Any bus that could be completely Unstated in 
any state uses a bus holder circuit to maintain proper levels. 

Special Considerations. The floating-point ALI ', which was 
leveraged from Ihe PA 7100 processor, drew static current 
and redesigning it was not feasible given our schedule con- 
straints. However, it is possible to eliminate the static- cur- 
rent during fnjjq measurements if the ALU is not evaluating 
during Ihe measuremenl. Since Idbq testing was not going 
lo be used lo lesl Ihe ALU, Ibis was acceptable. Ijjqq testing 
during parallel vectors is slill possible, but if a floating-point 

operation occurs that uses the ALI ', the ALI ' loses its inter- 
nal slale if IppQ lesl mode is enabled during the lest. 

Another area of consideration for IgpQ involved Ihe I/O bit 
slices. The CPC uses two power supplies, Vi,i> and Vi,l, 
which are nominally at ~Y and 3.3V respectively. Vdd Sup- 
plies all of Ihe internal chip logic, while V u i. is Ihe supply for 
the output driver pullup FETs. The inpul receivers on the 
CPl normally draw sialic current when an output driver is 
on that drives lo \'\n.- fa addilion. a circuit to hold the cur- 
rent value on the pad can draw sialic current if ihe pad is 
not driven to Vqd or ground. Therefore, when Ipncj measure- 
ments are taken, the output drivers are driven to ground 
through Ihe use of Ihe boundary scan circuitry lo eliminale 
sialic curreni flow in the receiver and pad holder circuits 
(see Fig. 4 ). The parallel tester drives input-only pins lo Vpp 
or ground as appropriate, including the IEEE 1149.1 interface 
pins. The analog inputs of Ihe clock buffers are also driven 
to appropriate values to prevent sialic current 

These rules were easy to adhere to and followed our ratio- 
nale to increase lesl capability wilh little design impact 
I|»Dy compliance was verified by running functional simula- 
tion cases through an IIP proprietory FET-level switch simu- 
lator which also has the ability Co check for sialic current 
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violations. Because of careful attention to the design guide- 
lines, only six Ipng violations were discovered when the 
simulations were ran, all of which were easily resolved. 

Iddq Measurement 

h>l>Q measurements are taken using a parametric measure- 
ment unit on the HP 82000 tester (see Fig. 5). When a mea- 
surement is to be laken. a vector sequence is run to place 
the device under test (DUT) into a static state. After the 
dynamic current transients have settled, the measurement 
unit is connected to the chip power plane with a relay, and 
the regular Vdd supply is then switched out with relays. The 
parametric measurement unit then supplies and measures 
the current flowing into the OUT. The power plane for the 
DUT is separated from the test fixture power plane by relays 
Connected between the chip and the test fixture. Bypass 
capacitors to control supply noise are placed on Vjjp on the 
power supply side of the relays. This is important because 
leakage currents in large electrolytic capacitors can be tens 
of microamps. which would compromise the accuracy of the 
measurement. 

Typical measurements are in the range of 1 uA. The Iddq 
current is dominated by reverse bias leakage current and 
subthreshold leakage. Measurements are taken during wafer 
and package test, and four measurements are made. Four 
parallel vectors are used, which initialize the registers, 
cache, TLB. and other state logic to zeros or ones and two 
patterns of alternating ones and zeros (to check for bridging 
faults). This provides a great deal of defect coverage while 
incurring minimal test overhead. 



(ODQ testing was very effective at catching defects on the PA 
7100LC Results indicate that 50% of scan test failures and 
7(1% of parallel failures are caught by Iddq testing. In addi- 
tion, other types of defects are caught that might not be 
caught by conventional voltage-level testing, like gale oxide 
shorts imd some types of bridging faults. These can lead to 
reliability problems over the life of t he product, so it is im- 
porlant to catch them at the chip test stage. 

We plan to do more directed Iddq testing on future chips, 
using scan testing and parallel testing to set up and measure 
current for specific chip slates indicated by automatic test 
generation tools. This should Improve t he level of coverage 
we gel for Iddq tests. However, one problem that may occur 
is ihat off-FET leakage will increase in the effort to improve 
FET performance in future IC processes. This has a direct 
effect on the ability of Ippg techniques to resolve low cur- 
rent defects. Additional techniques like power supply parti- 
tioning may be necessary to make Iddq usable with more 
advanced IC processes. 

Sample-on-the-Fly Testing 

An interesting new feature that is implemented on the CPU 
enables scan registers to capture the internal state of the 
chip while the chip is operating at speed in a normal system. 
We refer to this capability as sample-on-thc-fly testing. The 
sample is nondestructive, and the data can be accessed 
while the chip continues to execute code by scanning the 
results out using the on-chip IEEE 1 1-19. 1-compatible lesi 
access port (TAP). This feature was very useful for debugging 
and characterizing system-level performance because it is 
essentially a logic analyzer built directly into the chip which 
allows access to over 4000 internal slate values. Samples can 
be taken with any IEEE 1149.1-compatible test controller 
and appropriate si ill w are. 

Internal Sampling. The internal sampling capability allows a 
sample to occur when the architected PA-RISC interval 
tinier reaches a count that matches a preset value in a regis- 
ter and the TAP circuitry is in a specific state. In the PA 
7100LC the interval timer on the chip is a 02-bit register that 
increments by one for every clock cycle Ihat occurs on the 
chip. An additional 32-bil register provides a value to com- 
pare with the value in the interval timer register. This value 
can be set by doing a PA-RISC mtctl (move to control regis- 
ter)! instruction. When the interval timer v alue matches the 
value set by the mtctl instruction, a comparator circuit gener- 
ates a signal which is normally sent to the control logic to 
cause an interval timer interrupt to occur. This signal is also 
sent to the TAP in this implementation II" the current TAP 
instruction is ISAMPLE. I he state of the chip is sampled into 
each scan register on the following chip state by allowing 
each scan register to update during the phase when the 
functional latch is not being updated. An indication that a 
sample has occurred is sent from one of the test pins when 
the sample is taken. The pin can be monitored by an exter- 
nal IEEE 1149. 1-compatible controller system to determine 
when data can be shifted out of I he chip. The shifting of the 
sampled data does not corrupt the state of the internal logic-. 



t This instruction moves data to a control register In this instance it is moving data to the 
timer comparison register 
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Find the tailing code sequence ia die system 
through shmooing or functional tests. 



Insert micii instruction into the failing code 
sequence to cause interval timer to time out at a 
particular clock cycle. 



Arm trigger in the test circuitry through the test 
access port (TAP) interface. 



Run the modilied code sequence, causing the 
interval timer to trigger the test access port to 
take a one-state snapshot of the chip logic. 



Scan the sampled values out of the TAP pins 
serially while the chip continues to run 
uninterrupted. 



Compare the snapshot taken above to values 
from the simulation or from a known good snap- 
shot taken at the same point with another chip. 




Sample 
subsequent 
stale. 



Debug the failure to the failing circuit by 
examining the differences between the known 
good sample and the sample just taken. 



Fig, 6. Sainple-on-the-fiy testing process. 

If another sample is desired, the above procedure is simply 
repeated. Fig. 0 summarizes the sample-on-l he-fly process. 

Results. Although sample-on-the-fly testing capability required 
careful electrical ami timing design, ii has proven to he very 

effective for debugging, h was viiai at system frequencies 

approaching mo Mil/., since our traditional external debug- 
ging hardware was unable lo function at this frequency be- 
cause of electrical constraints. Samplc-on-t he-fly testing 
became our only debugging tool in systems with high-fre- 
quency crilical paths, it was used several dozen times in 
high-speed characterization and led to the resolution of sev- 
eral slow liming paths. Ii is clear thai as CPU frequencies 
increase, more debugging circuitry will need lo be included 
directly on the chip lo assist in diagnosing functionality, 
speed, ami electrical failures. 

Debug Mode 

The sample-on-the-fly technique allowed us to observe the 
values present at many nodes, at one very specific point in 
lime, and at any operating frequency. Since this test tech- 
nique uses ihe lesl acc ess port to observe lltese values, il 
provides information about Ihe chip state al a relatively low 
bandwidth. This information is an extremely valuable diag- 
nosis tool for designers because it enables them lo know 
exactly when a problem is occurring. 



Sometimes, espec ially when a problem is not yet fully under- 
stood, a higher-bandwidih path to diagnostic information is 
useful to designers. To allow designers access to larger 
amounts of information across broad slices of time, we 
added a debug mode to the PA 7100LC. This mode makes 
available externally ihe values of several key internal buses 
and control interfaces, on a state-by-state basis. 

Software can place the chip in the debug mode by executing 
a series of CPL" diagnostic instructions. Software can also 
be used to choose a set of signals to be made externally vis- 
ible. These signal sets were carefully chosen by the chip's 
designers as being indicative of the internal state of the CPU. 
Examples of signal sets that can be made visible using the 
debug mode include- 

1 Internal instruction and data buses 

• CPU to memory and I/O controller interface 

1 Key cache controller state information. 

When the chip is operating in the debug mode, it identifies 
unused cycles on the I/O bus and uses them to drive the se- 
lected debug information onto Ihe I/O bus. The debug cir- 
cuiiry can be programmed by software either lo throw away 
debug data during states when the I/O bits is unavailable, or 
to cause the CPI ' pipeline lo stall during these states so that 
no debug information is lost. 

Exlemally driving debug information allows engineers lo sec 
a sufficient amount of state information on a large enough 
number of CPU states to be able to quickly direct further 
efforts at locating postsilicon problems. 

Both debug mode and sample-on-the-fly turned out to be in- 
valuable debugging aids in the lu'glily integrated environment 
ofthePATlOOLC. 

Conclusion 

Supporting design methodologies allow Implementation of 

the features that a product requires lo meet its design goals. 
The methodologies used to synthesize, place and route, sim- 
ulate, verify, and test the PA 7100LC processor were crucial 
to the processor's success. 
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An I/O System on a Chip 



The heart of the I/O subsystem for the HP 9000 Model 712 workstation is 
a custom VLSI chip that is optimized to minimize the manufacturing cost 
of the system while maintaining functional compatibility and comparable 
performance with existing members of the Series 700 family. 

by Thomas V. Spencer, Frank J. Lettang, Curtis R. McAllister, Anthony L. Riccio, Joseph F. Orth, and 
Brian K. Arnold 



The HI' 9000 Model 712 design is based on three custom 
pieces of VLSI that provide much of the system's functional- 
ity: C'Pl '. graphics, and I/O. These chips communicate via a 
high-performance local bus referred lo as GSC (general sys- 
tem conned ). This paper will focus primarily on the I/( ) chip. 

A major goal of the Model 712 I/O subsystem was to provide 
a superset of t he I/O performance and functionality avail- 
able from other family members at a significantly reduced 
manufacturing cost. This goal was bounded by the reality of 
a finite amount of engineering resources, and it was obvious 
from the stall that integrating several of the [/< ) functions 
onto a single piece of silicon could greatly reduce the total 
I/O subsystem manufacturing cost. Each function of the I/O 
subsystem was examined individually as a Candidate for 
integration. The value of maintaining exact driver-level soft- 
ware compatibility was also evaluated with respect to the 
advantages of minimizing the hardware cost for each of the 
I/O functions. 

The investigation indicated that the optimal solution for Ihe 
Model 712 was an I/O subsystem that centered around a 
single piece of custom VLSI. The chip that resulted from this 
investigation directly implements many of the required I/O 
functions and provides a glueless interface between the GSC 
bus and other common industry I/O devices. This chip was 
named LASI. which is an acronym thai refers to the two 
major pieces of functionality in the chip. I.AN and SCSI. 
The LASI chip also provides several miscellaneous system 
(unctions that farther reduce the amount of discrete logic 
tequiKd in the system. 

Chip Overview 

The LASI chip was designed in a 0.8-um CMOS process and 
is 13.2 mm by 12.0 mm in size ( including I/O pads). It con- 
tains 520,000 FETs and is packaged in a 240-pin MQCAD 
package. LASI dissipates approximately three watts when 
operating at the maximum GSC frequency (40 MHz). LASI 
was designed primarily using standard-cell design methodol- 
ogies although several areas required full custom design. 

A functional block diagram of LASI is shown in Fig. 1. The 
majority of circuitry in LASI is consumed by only two func- 
tions, LAN and SCSI. Both of theses designs were purchased 
from outside companies and ported to HP's design process. 
The SCSI functionality is exactly identical to the NCR 
53C710 SCSI controller, and the LAM functionality is exactly 
identical to an Intel 82C596 LAN controller. 



Other I/O functionality that is completely implemented on 
IAS I with HP internal designs includes: RS-232, Centronics 
parallel interface, a battery-backed real-time clock, and two 
PS/2-style keyboard and mouse ports. In addition. LASI pro- 
vides a very simple way of connecting the WD37C65C flexible 
disk controller chip to the GSC bus. The system boot R< )Ms 
are also directly controlled by the LASI chip. The Model 712 
provides 16-bit CD-quality audio and optionally supports 
two telephone lines. LASI provides Ihe GSC interface and 
clock generation (using digital phase-locked loops) for both 
of these audio functions. Fig. 2 shows an approximate door 
plan of the LASI chip. The floor plan shows the general lay- 
out and relative size of each block. 

LASI contains several system funclions thai help lo minimize 
the miscellaneous logic required in the system. This includes 
GSC arbitration and reset control. LASI also serves as the 
GSC interrupt controller. 

Il is possible lo use up to four LASI chips on the same GSC 
bus. JASI can be programmed at reset to reside in one of 
four different address locations. The arbitration circuit sup- 
ports chaining, and LASI can be programmed to either drive 
or receive reset. 

System Support Blocks 

The following seel ions give a brief overview of each of 
LASI's major functional blocks that provide system support 
functionality in the Model 712. but do not directly support or 
implement any I/O fund ion. 

GSC Interface. The GS( ' (general system connect i bus con- 
nects the major VLSI components in the Model 712. It is a 
32-bit bus with multiplexed address and data. The bus con- 
sists of 47 signals for devices capable of being a bus master. 
The GSC bus is defined lo ran at up to 40 MHz giving a peak- 
transfer rate of 160 Mbytes/s. 

The (JSC interface block in LASI provides the connectivity 
between Ihe GSC bus and Ihe wide variety of internal bus 
blocks, many of which have different logical and timing re- 
quirements. This block conv erts the GSC bus to a less com- 
plex internal LAS] bus. The LASI internal bus is very similar 
to the GSC bus. but it is not as heavily multiplexed and is 
more flexible than the GSC bus in that it easily accommo- 
dates the simpler interface for the general-purpose I/O 
blocks in LASI. The GSC interface block handles bus errors 
and keeps track of parity information for other internal 
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blocks, removing tin' associated complexity from these con- 
trollers. Bolh master ami slave devices reside on lire LASI 
inlenial bus. 

LAS] is a slave whenever the CPl ; initial es data transfer. As 
a slave, LASI supports only subword and word write, and 
subword, word, and double-word reads.* Internal slave de- 
vices only need to support a subsel oftiiese transactions. 
There are five different protocol behaviors for slave devices 
in L\SI: unpaced byte wide, paced byte wide, packed byte 
wide, unpaced word wide, and paced word wide. 

UlipaGed devices, such as the real-time dock, don't use a 
handshake with the GSC interface, making their protocol 
very simple. When a device requires a variable length Of 
time (0 transfer data it is c alled paced. The SCSI interface is 
an example of a paced device. A packed device is one that 
sends a sequence of bytes lo make up a word or double 
word. The bool ROM interface is an example of a packed 
device. 

' In PA-RISC a subword is typically one bvte, a word is 32 bits, a double word is 64 bils. and a 
quad wind || 128 bits 



A simple strobe signal is asserted while internal dala and 
address buses are valid. Inlenial devices have no direct in- 
leraciion with bus errors. 

As a bus master. LASI is capable of initiating subword. word, 
double-word, and quad-word transactions on the GS( ! bus. 
Once one of LASI's internal bus masters owns lite bus, it can 
signify the start of a transaction by asserting the master_vahd 
signal (see Fig. 3). The device must then simultaneously 
drive its DMA address (master_address), transaction type, and 
byte enables onto the bus. On a read, tire first available data 
WOK) will appear on the internal bus when the master_ac- 
knowledge signal is asserted by the CSC interface. The t iS( 
interface will not accepl another master_vahd until all the read 
data has been transferred. 

If a I imeoul error, address parity error, or dala parity error is 
encountered on the CSC bus. the CSC interface will always 
do a normal handshake for the transaction by asserting the 
master^acknowledge signal. The transaction will complete as 
usual except that an error is logged, disabling arbiiralion for 
the dev ice so it cannot be a bus master again. This means 
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thai internal masters, at i he hardware level; never need to 
respond directly to bus errors. When the GSC interface 
block sees a timeout error il will, from Ihe perspective of its 
internal bus blocks, complete a transaction normally. In this 
way the GSC's error signaling mechanism can correctly termi- 
nate an errant transaction without adding complexity to 
LASl's internal blocks. 

Parity is generated in the GSC interface whenever LASI 
sources data or an address on the bus. Parity is checked 
whenever LASI is a data sink. LASI does not respond to 
address parity errors on the (JSC bus, which result in a 
timeout error. 

Arbitration. LASI contains six different blocks capable of 
initiating a transaction on the GSC bus (see Fig. 1). To initi- 
ate a transaction, a block must first own (or gain control of) 
Ihe GSC bus. Deciding which potential master owns the bus 
is the job of LASl's arbitral ion block. The arbitration circuit 
in LASI provides internal bus arbitration for all six internal 
devices and provides external GSC arbitration signals for 
the CPl' and an expansion slot. This capability allows 1.ASI 
to function as the central arhiter for the GSC bus in low-end 
systems. The arbitration circuit can also be pin-programmed 
at reset to behave as a secondary arbitration device that is 
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controlled by another arbiter. This feature allows LASI to be 
Used in larger systems that provide their own arbitration 
circuit . A second LASI can also be used for I/O expansion 
in low-end systems in which Ihe first LASI is providing the 
central arbitration. Support for multiple LASI's on the same 
GSC bus makes Ihe speedy development of multifunction 
I/O expansion boards a relatively simple task. 

The LASI design was simplified by requiring that the LASI 
arbitration circuit gain control of the GSC bus before graining 
the internal bus to potential bus masters. Tltis saved a signifi- 
cant amount of complexity in the GSC interface block as well 
as greatly reducing Ihe number of cases that needed to be 
tested during the verification effort. This simplification does 
create a couple of wasted GSC cycles for each transaction 
initiated by LASI. However, this inefficiency has a negligible 
impact on system performance. 

The LASI arbitration circuit provides a simple round-robin 
scheme that provides roughly equal access to all devices. 
The arbitration Circuitry keeps track of the identity of the 
last device granted the bus and all currently outstanding 
requests. (A simple truth table makes sure the GSC resource 
is handed out fairly. ) If no devices are requesting the bus. 
LASI will default to granting the bus to the CPl". This has a 
small positive impact on performance, given that the CPU is 
the most likely device to initiate Ihe next transaction. This 
arbitration scheme helps simplify the arbitration circuit by 
not requiring it to monitor bus activity. Each bus master is 
responsible for being "well-behaved" with respect to bus 
use. 

The arbitration circuit plays a key role in Ihe error handling 
strategy for LASI. If an error occurs on the GSC bus while 
LASI is the bus master, Ihe arbitration Circuit will not grant 
Ihe bus to additional internal devices until the CPU clears the 
error by clearing a bit in the arbitration circuit. This simpli- 
fies the design of other devices within LASI by not requiring 
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them to use the error signal as an input to their state ma- 
chines. When an error is detected, the transaction will termi- 
nate normally, but no additional transactions will be allowed 
until the situation is rectified by software. 

Interrupt Controller. A total of 13 different interrupt sources 
exist on the LASI chip. Each interrupt source drives a single 
signal to the interrupt controller block. When the interrupt 
signal is asserted, the interrupt controller blin k will master 
the bus and issue a word write to the I/O external interrupt 
register (I0_EIR). which is physically located in the CPU, The 
data transferred to the I0_EIR contains a value that indicates 
the source of the interrupt The address of the I0_EIR and the 
interrupt source value can be programmed by writing to the 
interrupt address register located in LASI"s interrupt control- 
ler block. Individual interrupt sources can be masked by 
setting bits in the interrupt mask register. 

LASI's interrupt controller is designed to provide a variety of 
interrupt approaches. The Model 712 uses only one of these 
alternatives. Asserting an interrupt causes a write to the 
I0_EIR to be mastered on the GSC bus. Upon receiving an 
interrupt from LASI (via I0_EIR). the CPU will read Die inter- 
rupt request register located in LASI's interrupt controller 
block. One bit in the interrupt request register is designated 
for each potential interrupt source in LASL The interrupt 
request register is cleared automatically after it is read by 
the CPU. 

Real-Time Clock. The Model 712 needs to keep track of time 
when the system power is off. To this end, LASI provides a 
battery-backed real-time clock. The real-time dock is log- 
ically very simple and consists of a custom oscillator circuit 
and a 32-bit counter that can be read and written to by soft- 
ware. The 32 -bit counter is used to keep track of the number 
of seconds that have elapsed from some reference time. 

The oscillator unit operates at 32.768 kHz and typically uses 
less than 10 uA of current when operating on battery backup. 
It uses a minimum of external circuitry (consisting of t wo 
capacitors, a crystal, and a resistor) to accomplish its task. 

Inside the LASI real-time clock, the 32-kIIz signal is reduced 
l.o a 1-Hz signal by a 15-bit precounter. The 1-Hz signal is 
then used to increment the main 32-bit counter. Both the 
Counter and the precounter are implemented using simple 
ripple counters. The 15-bit precounter is always cleared 
when software writes to the 32-bit counter. 

Phase-Locked Loop Clock Generators. The goal for the LASI 
clock subsystem was to generate all the I/O subsystem clocks 
from one crystal oscillator over a wide range of system fre- 
quencies. Trie LASI clock block generates five different 
clock frequencies required for the wide variety of I/O inter- 
faces. Three of these clocks are subharmonics of the proces- 
sor clock, and are generated using simple digital state ma- 
chines. However, the 40-MHz clock and the audio sample 
clock are fixed-frequency clocks. The 40-MHz clock is used 
for the SCSI back end and RS-232 baud rate generator, and 
the audi, i sample clock is used for the external < '< >DEC chip. 
The frequency of this clock (1(5.9344 MHz to 24.576 MHz) is 
Selectable on the fly by the audio and telephone interface. 

Two digital phase-locked loop circuits are provided in LASI 
lo generate the two fixed-frequency clocks from the ( I'l ' 
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Fig. 4. Phase-locked loop clock controllers. 

clock. These digital phase-locked loops implement the equa- 
tion* fclockout = (Xciocktn x N)/(M1 x M2). where N. Ml, and 
M2 are digital coefficients storeii in LASI control registers, 
fc iorkm comes from the main system reference clock. The 
frk-ekout f rom one of 'he phase-locked loops is used for the 
audio clock, and the f^bekod from the other phase-locked 
loop Ls used for SCSI, RS-232. and other I/O functions. At 
power-on. the processor initialization code (stored in the 
flash EPROMs) loads the coefficients corresponding to the 
processor clock for the particular product. The audio sam- 
ple clock has two sets of coefficient control registers, which 
are selected by a multiplexer based on a signal from the 
audio interface. P'ig. 4 shows one of the phase-locked loop 
circuits. 

The phase-locked loop circuits are completely digitally con- 
trolled, including a digitally controlled oscillator, digital phase 
detector, counters, and scan test hardware. This design elimi- 
nates analog control voltages which are susceptible to noise 
and integration errors. The digitally controlled oscillator is a 
ring oscillator with a digitally programmable delay element. 
This design is capable of generating frequencies of up to 135 
MHz. A combination of custom and standard-cell design 
techniques are used in this design. Each phase locked loop 
cell measures 1500 um by 890 um. 

General I/O Functions 

The blocks shown in Fig. 1 that make up the general I/O 
functions include the parallel port, audio and telephone inter- 
face. RS-232 port, and flexible disk and boot ROM interface. 
These I/O functions originate from HP internal standard-cell 
designs that were originally designed using Verilog RTL 
models and then synthesized into a Standard-Cell design 
using Synopsys. Some blocks were designed specifically for 
MSI while others were leveraged from previous HP ASIC 
designs. 

Parallel Port. The parallel port is designed to be software 
compatible with previous generations of HP 90011 Series 700 
I/O subsystems while minimizing overall complexity and 
chip area. This port allows interfacing to printers and other 
peripherals supporting the industry-standard Centronics 
parallel interface. The parallel port signals are driven di- 
rectly from the LASI chip without additional buffering. 

DMA was supported on previous workstation controllers 
and therefore needed to be provided on LASTS controller. 
However, since no central DMA controller exists, all DMA 
hardware is contained within the parallel I/O block. Since 
parallel port bandwidth requirements are fairly modest 
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Fig. 5. Address latching logic, and the data and control lines associ- 
ated Willi the external S-bil bus. 

(about 400 kbytes/s), DMA is done by reading one ;!2-bil 
word of data, releasing the bus. transferring one to four 
bytes of data over the interface, and then requesting the bus 
again. This approach keeps the DMA controller quite simple 
while easily accommodating byte unpacking. 

Keyboard and Mouse Controller. IASI provides support for two 
IBM PS/2-style keyboard and mouse devices, making the 
keyboard and mouse ports just like those used on a standard 
IBM personal computer. These interlaces arc new to the 
Series 700 family so there were no software compatibility 
issues, allow ing us to optimize the design for low manufac- 
turing cost. The interface provides only a minimal amount of 
hardware and relies on the driver to do most of the work. 
The interface also performs the serial-to-parallel and paral- 
lel-to-serial conversion and does a small amount of buffer- 
ing. An interrupt is generated for every byte of data received 
from the PS/2 device. The software overhead is not a perfor- 
mance issue because of the extremely low data rate of the 
interface. 

Flexible Disk and Boot ROM Interface. LAS1 suppi irts an exter- 
nal S-bit bits that provides the capability to connect discrete 
flash EPROM devices and a flexible disk controller with 
very little additional logic. Fig. 5 shows a simple schematic 
of a Hash EPROM and the required address latching logic on 
(he 8-bit bus. It was not cost-effective to integrate these de- 
vices into the LASI chip. The 8-bit bus is also capable of sup- 
porting other types of S-bil devices, giving some degree of 
flexibility (o the I/O system. 

The S-bit bus supports 1M bytes of address space ((he firs( 
half of the LASI address space). All transactions to this ad- 
dress space on the 8-bit bus begin with two address cycles. 
These cycles transfer bits 18:3 of the address to two 
74GHT374-type 8-bit latches wired in series and controlled 
by LASI. Multiplexing the address on lite data lines saves 15 
pins on LASI. 

LASI is capable of supporting byte. word, and double-word 
reads and byte writes to devices on the 8-bit bus. Word and 
double-word reads are accomplished by doing multiple ac- 
cesses to devices on the 8-bit bus and packing the bytes into 
words before returning them on the GSC bus. Word and 
double-word accesses require the address to be latched only 
once since LASI drives the lower three address bits directly. 
This greatly reduces the word and double-word access time. 
Double-word reads take approximately 75 GSC cycles to 
complete because eight accesses are required on the 8-bit 



bus. During each of the eight accesses a new address is pre- 
sented to the Hash EPROM which results in valid data being 
driven to the 8-bit bus by this flash device. Byte accesses are 
also relatively slow (12 GSC cyciesj to support very slow 
devices on the 8-bit bus. It is important to note that the 8-hit 
bus is not electrically connected to the GSC. 

LASI is designed specifically to support the WD37C65C flex- 
ible dLsk controller on the S-bit bus. The Model 712 uses a 
personal computer style flexible disk controller instead of a 
SCSI-based flexible disk controller because of the signifi- 
cantly lower cost of the drive mechanism. The flexible disk 
controller was not integrated into the IASI chip because of 
the low cost of the WD37C65C chip and the potential for 
SCSI drives to come down in cost in the future. The 
W 1 >37< '(i. r il -hares the dala bus and two control lines With 
other devices on the S-bit bus, but does not consume any of 
the 1M bytes of allocated address space. Supporting the 
Wl (37C66C requires six dedicated signals and no external 
glue logic. LASI supports the WD37C65C running in DMA 
mode and provides the capability to move data directly be- 
tween main memory and the WD37C65C without processor 
Intervention. 

RS-232. The RS-232 block in LASI is an IIP internal standard- 
cell design that emulates the behavior of the National Semi- 
conductor NSI6550A. The Verilog HDL description for this 
design was leveraged from previous HP ASIC designs used in 
other members of the HP 9000 Series 700 workstation family. 

One difference between this block and the NS1G550A is that 
its baud clock is derived from a 40-MHz signal. This allows 
the block to share the phase-locked-loop-generated 40-MHz 
clock with the back end of the SCSI block and eliminates 
the need to support an external crystal or dedicated phase- 
locked loop for baud clock generation. 

Audio Interface. The Model 712 supports built-in CD-quality 
audio and an optional telephony card. 1 The telephony card 
is DSP-based and provides simultaneous access to two tele- 
phone lines both capable of support ing voice, fax, or data 
modems. IASI provides the interface between the GSC bus 
and the audio and telephony circuitry. 

An objective for the Model 712 audio subsystem was to 
maintain complete software compatibility with previous dis- 
crete designs. As a result, a good deal of the audio interface 
circuitry on IASI is dedicated to supporting this compatibility 
and is not optimized for minimal manufacturing cost. 

The audio interface in LASI has two DMA channels that sup- 
port the input and output audio streams. Each channel has 
two 4K-byte pages of main memory continually reserved for 
transferring data to and from the CS4215 CODEC. The buff- 
ering in the interface is sufficient to guarantee isochronous 
audio operation, given worst -case GSC bus latencies in the 
Model 712. A wide range of audio formats is supported in- 
cluding 8-bit or 10-bit words sampled in either linear, ii-law, 
or A-law format at a variety of sample rates from 8 kHz to 48 
kHz. 1 The clock that determines the sample rate in the 
CODEC is generated in one of LASI s programmable phase- 
locked loop circuits. Communication between IASI and the 
CODEC is accomplished via a full-duplex, serial bit stream. 
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The high-speed serial bus over which LASI communicates 
with the daughter card is similar to a concentrated highway 
bus developed by AT&T but has several modifications. The 
core pinout is the same using the signals data transmit (OX ). 
data receive (OR), and frame synchronization (FS), but the 
definition of the bus has been extended to incorporate 
control of external buffers and bus reset. 

Communication with the telephony card is accomplished via 
two TTY channels internal to LASL The serial concentrated 
highway bus data is multiplexed onto the high-speed serial 
stream and sent to the CODEC and the telephony card. 
Since TTY devices are used, the driver for the telephone 
system is a liighly leveraged version of the existing TTY driv- 
ers. The audio interface and HP Teleshare- have a common 
digital interface which resides in LASI. HP Teleshare is de- 
scribed in more detail in the article on page 69. 

Megacell I/O Functions 

LASI contains two megacells whose designs were purchased 
by HP from external vendors. The decision to do this was 
based on maintaining software compatibility with past HP 
9000 Series 700 workstations and the availability of engi- 
neering resources in HP. In both cases, an important goal 
was to maintain the integrity of the megacell as much as 
possible. A definite boundary was drawn between function- 
ality leveraged from external vendors and new design work. 
This boundary proved vital to functional verification and 
production testing. 

LAN Megacell. I EKE 802.3 LAN support is provided by a 
megacell derived from I he Intel 82C596 LAN coprocessor. To 
understand the integration, two key areas should be consid- 
ered. First, importing the megacell at the artwork level 
solved some problems and imposed others. Second, in the 
area of interfacing, the integrated megacell eliminated a 
substantial number of chip pins but raised some protocol 
issues that had to be overcome. 

The IAN megacell was unpolled into our IC design flow at 
the artwork level. Because the original Intel design was 
done in a custom fashion, a netlist translation would have 
required a significantly longer design time and a much larger 
manpower deployment than the artwork translation. Even at 
the artwork level, several modifications were made because 
of differences between the original CMOS process design 
rules and those of our target process. 

One challenge in importing the megacell at the artwork level 
was developing a verification strategy thai allowed concur- 
rent .simulation of the megacell and Ihe rest of the chip. Be- 
cause the megacell vendor used proprietary simulators run- 
ning in a mainframe environment, the vendor simulation 
models couldn't be used in our Verilog-based environment. 
Hardware modeling was explored, but characteristics of the 
part made this solution impractical. Converting either func- 
tional representations or transistor-based representations to 
Verilog IIDL raised loo many concerns about modeling at cu- 
racy. In view of these roadblocks, an unconventional ap- 
proach to simulation modeling was employed. First. FET- 
level model was extracted from the artwork. This model was 
turned on and verified using Intel's production test vectors 

and a proprietary In-house simulator. Second, the in-house 

simulator was compiled and linked into the Verilog simulator 



using a procedural-level interface. Third, a Verilog HDL in- 
terface module was written that defined synchronization 
events for data transfer between the two simulators, and the 
model was reverified using production vectors. Finally, tests 
were run that were specifically designed to test the interface 
between the megacell and the internal bus. 

Integrating the LAN megacell did provide a clear win by im- 
proving the ratio of I/O to core area When sold as a separate 
device, the Intel 82C596 has 89 signal pins devoted to the 
host interface, Once the megacell was integrated, all of 
these signaLs remained on-chip. In addition. 77 of the re- 
moved signal pins had output drivers, so the associated 
power and ground pins were eliminated 

The megacell did require a small amount of circuitry to inter- 
face Ihe 82C596 bus to the LASI internal bus. The primary 
difficulty in Ibis area was burst transactions. The system bus 
wanted to know at the start of the transaction how many 
words were to be bursted. In contrast, the 82C596 burst 
protocol would only indicate whether or not it had one more 
word to burst. To minimize complexity and avoid the area 
associated with a FIFO buffer, the decision was made to 
support only two-word bursts. This logical intersection of 
the two bursting protocols provided a bandwidth utilization 
improvement over nonbursted transactions w hile minimizing 
chip area and development time. 

SCSI Megacell. To provide SCSI-2 support, LASI uses an NCR 
53C710 megacell. This megacell was imported into our design 
methodology as a netlist port . The design was translated from 
NCR's standard-cell library to HP's cell library. A few unique 
components were added to HP's library specifically to sup- 
port Ihe SCSI megacell. While this created some challenges, 
doing a schematic port allowed more flexibility to optimize 
the aspect ratio of the megacell for a more efficient floorplan. 
This technique also masked differences between NCR's pro- 
cess and HP's process. Verilog models for this schematic 
port were simulated in Ihe conventional way. 

The programming and SCSI bus model for the 53C710 
megacell is completely compatible with the industry-stan- 
dard component marketed by NCR. However. Ihe host side 
interface of the megacell is modified to eliminate the pads 
and replace them wilh standard-cell components. These 
components conned directly to internal megacell signals, 
providing an interface to the chip's internal bus. 

The 63C710 can be a master and a slave device on the GSC 
bus. LASI's interna] bus protocol for slave transactions re- 
quires only combinational logic between the megacell and 
the internal bus. As a slave, byte and word transactions are 
supported in the megacell. If SCSI is a bus master, the inter- 
face supports all Ihe transaction types needed by the mega- 
cell, with the help of a small slate machine located in the 
SCSI interface block shown In Fig. 2. SCSI data is typically 
transferred using four-word read and write transactions on 
the GSC bus. 

Test Support 

The primary objective for LASI testing was to provide an 
extremely high level of coverage with a limited amount of 
test development resources. Test support was complicated 
by Ihe diverse nature of Ihe eircuils on LASI. 
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The non-megacell functionality is tested by a combination of 
parallel pin vectors in conjunction with aulonial it-ally gener- 
ated scan vectors. LAS1 has an enhanced JTAG (IEEE 
1 140. 1 ) test block, 25 distributed internal I/O device scan 
chains, and embedded test functionality in the I/O pads. The 
JTAG test block contains a test access port and boundary- 
scan architecture defined in IEEE standard 1 149.1-1990 and 
private instructions used for clock control, full-chip step 
control, and specific scan-chain functions. 

To maximize test coverage for the two megacells and to 
minimize the required test development resources, the vec- 
tors used for production testing by Intel and NCR are used 
on LAS1. Doing this requires multiplexing all megacell sig- 
nals to pads to create what looks to the chip tester like an 
Intel 82C59G or an NCR 53C710, depending on the test 
mode. 

This technique provides important verification and test cover- 
age, but complicated the design. Each output pad includes a 
three-input multiplexer, and each input pad drives signals to 
three destinations on-chip, significantly increasing the load- 
ing. The additional routing complexity requited devoting 
more space for routing channels, and the larger pads reduced 
placement flexibility. 

Conclusions 

Integrating multiple I/O functionality onto a single VLSI chip 
can significantly reduce the cost of the I/O subsystem. How- 
ever, many system dependent factors and each candidate 



functionality need to be examined carefully in the system 
context before deciding to integrate. Some important system 
considerations are software compatibility, the cost of dis- 
crete alternatives, the cost of printed circuit board area, 
customer connect rate, available IC fabrication capacity, 
available engineering development resource, and so on. The 
I.AS1 chip definition is the result of a detailed investigation 
into optimizing an I/O system for HP's low-end workstations. 
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An Integrated Graphics Accelerator 
for a Low-Cost Multimedia 
Workstation 

Designing with a system focus and extracting as much performance and 
functionality as possible from available technology results in a highly 
integrated graphics chip that consumes very little board area and power 
and is 50% faster and five times less expensive than its predecessor. 

by Paul Martin 



The graphics subsystem of the Model 712 workstation is a 
high-performance, low-cost solution that sits directly on the 
system bus of the Model 712 and c onsists of the graphics 
chip, a video RAM-based frame buffer, and a few support 
chips (see Fig. 1 ). The project goals closely renect those of 
the overall MP 9000 Model 712 program. In priority order 
these goals were: 
Very low manufacturing cost 

Leadership graphics performance at entry cost levels 
Architectural compatibility 
Compelling new functionality. 

Achieving these goals required a major step in the evolution 
of IIP entry-level graphics workstation hardware. Two philos- 
ophies helped the team responsible for the graphics chip 
achieve these goals. The first guiding philosophy was to 
design with a system-level focus. We examined all required 
functionality to decide whether it was best to implement it 
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in the graphic s subsystem, the host processor, or some cont- 
inual ion of the two. 

The second philosophy was to extract as much performance 
and functionality as possible from readily available technol- 
ogy. We avoided leading-edge teclmology because of the cost 
implications. We did make an attempt, to use all the features 
and performance available in mature technologies such as 
video RAMs (VRAMs) and HP's CMOS26B IC process. 

This article describes the features and functionality of the 
IIP 9000 Model 712 graphics subsystem. The considerations 
that weni into accomplishing the goals mentioned above are 
also described. 

Architectural Compatibility 

The CRX window accelerator card t introduced by IIP in 
1991 marked the beginning of a standardized graphics hard- 
ware architecture for window system acceleration.' This 
architecture was chosen for its simplicity of implementation 
and for the clean model it presents to Ihe software driver 
developers. One of our fundamental design decisions was to 
accelerate key primitives only— a RISC approach. Many ear- 
lier controllers chose to accelerate a large gamut of graphi- 
cal operations such as ellipses, arithmetic pixel operations, 
and so on. < iraphics subsystems designed with these con- 
Irollers were typically expensive and exhibited only moder- 
ate window system performance. In the CRX and subsequent 
accelerators, including Ihe Model 7I2's graphics chip, we 
decided to accelerate a carefully chosen smaller set of prim- 
ilives, which are described in Ihe following sections. 

Block Transfer. Writing pixels from system memory to the 
frame buffer or reading from th£ frame buffer to system 
memory is a block transfer (see Fig. 2). Writes are used to 
transfer image data to the frame buffer. Reads are used pri- 
marily to save portions of the screen temporarily obscured 
by pop-up menus (see Fig. 2b). 



I A window acceleraiot is the hardware lhal provides Ihe images seen on Ihe woikstaliun 
monitor. In particular, an accelerator is geared toward speeding up environments such as 
the X Window System the window accelerator enables the last movement ol windows on 
Ihe screen, scrolling ol text, painting ol window borders and backgrounds, and so on 
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Fig. 2. (a) Block transfer write, (b) Block transfer read. Window B 
obscures wint.low A. The obscured area is stored in system memory 
for restoration when the area of window A is exposed. 

Block Move. A block move involves transferring pixels from 
one rectangular area in the frame buffer to another ( possibly 
overlapping) area in the frame buffer (Fig. 3). This is very 
useful for moving windows on the screen and scrolling lines 
of text within a window. The block move in the graphics 
chip supports Boolean operations on the data being moved, 
such as highlighting text by complementing colors. 

Vectors. The ability to draw vectors (line segments) very 
quickly is a requirement of design applications such as sche- 
matic capture and mechanical design ( Fig. 4). Thus, the 
graphics chip has a high-performance vector generator that 
creates X Window System-compliant line segments. 

Fast Text. Characters are accelerated by the graphics chip 
because of their pervasive use in window systems and the 
large potential for performance improvement over software- 
only solutions. A character is defined as a rectangular array 
of pixels thai contains only two colors called foreground 
and background colors. Because there are only t wo choices, 
a single bit is sufficient to specify the color of each pixel in a 
character. This improves performance by reducing the 
amount of data that is transmitted from the processor to the 
graphics chip. For example, the hp character in Fig. "> requires 
only 8 bytes of data versus 48 bytes if tliis optimization had 
not. been made. 

Rectangular Area Fill. This primitive is widely used by win- 
dow systems to generate window borders, menu buttons, 
and so on (Fig. (i). It is also important for applications such 
as printed circuit board layout and IC physical design. Rect- 
angular areas can be patterned using two colors or contain 
only a single color. Hardware acceleration again gives a 
large speedup over software-only solutions. 

Cursor. I'ntil the late lSISOs when hardware cursors stalled 
appearing in video IC's. screen cursors were typically gener- 
ated using software routines. Hardware support Ls a good 
trade-off because the circuitry is relatively simple, and a 
system without hardware acceleration can spend a signifi- 
cant portion of its time updating the cursor. A (j4-by-(S4-pixel. 
two-color cursor is supported directly in the graphics chip. 
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Fig. 4. Vector primitive A vector is drawn by Hinting on successive 
pixels using the Brescuham algorithm. 

More complex functionality such as wide lines, circles and 
ellipses, and 3D primitives are not accelerated directly by 
the graphics chip because the application performance 
improvement was determined to be too low for the cost of 
implementation, These functions can be efficiently imple- 
mented in software. This is an example of the system-level 
design trade-offs mentioned above. 

An important aspect of this standardized architecture is 
software leverage. It is estimated that several software 
engineering years were saved on the graphics chip because 
the architecture Ls virtually identical to that of the CRX 
graphics subsystem. The savings in software engineering 
time was applied to tuning and adding new functionality 
instead of rewriting drivers. 

Graphics Chip Operation 

To get a better understanding of the operation of the graphics 
chip let's follow a graphics primitive through the block dia- 
gram shown in Fig. 7. A vector is a good example because it 
involv es all of the blocks in (he chip. Assume we have a vec- 
tor thai starts at x,y coordinates 0,0, is 8 pixels long, and has 
a slope of 1/2. 

First, several parameters are calculated to set up the vector 
in the graphics chip. This is done by graphics software (e.g., 
the X Window System) running on the PA 7100LC CPU. The 
high-level specification of a vector is: 
Starting x.y coordinate 
Finding x.y coordinate. 
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Fig. 3. Block move. Rectangular area A is moved to a new. possibly 

overlapping location. 



Fig. 5. Fast text primitive. A character is a rectangular array Con- 
taining two colors, foreground and background colors. Only a single 
bit is needed to specify each color. 
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This data is transferred across the GSC bus. through the 
GSC interface, and into a set of registers in the macro func- 
tion unit- If these registers are already in use by the macro 
function unit the data is placed in a 32- word-deep FIFO 
buffer that the unit can access when it becomes free. This 
increases efficiency by allowing overlap between the soft- 
ware and hardware processes. The macro function unit's 
basic job is to break down the high-level descriptions of 
graphics primitives such as vectors, text, and rectangles into 
a series of individual requests to draw pixels. 

Drawing the vector is automatically triggered when the last 
of the parameters described in the specification is written 
into the macro function unit. The macro function then steps 
its way along the vector using the Bresenham algorithm 2 
and issues requests to draw pixels. Since the slope of our 
vector is 1/2, the y-coordinate is incremented after every 
two steps along the x-axis as indicated in Fig. 8. 

One might expect that a separate x- and y-address would be 
specified for each pixel to be written. However, with vectors 
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Fig. 6. Rectangular area fill primitive. A rectangle is defined by 
comer, width, and height. Color <>r partem may lie applied 

there is excellent coherence between successive x- and y- 
addresses as pixels are drawn sequentially along the vector 
Thus, there are special bus cycles between the macro func- 
tion unit and the data formatter thai specify that the pre- 
vious x- or y coordinate should be incremented or decrem- 
ented to generate the new coordinate. This saves sending a 
full x.y coordinate pair for each pixel drawn and significantly 
improves bandwidth use on the bus. This optimization is 
also useful for other primitives such as text and rectangles. 
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components inside the graphics 
chip. 
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Fig. 8. Pixel representation of a vector thai starts at. coordinate U,0. 
is S pixels lung, and has a slope of 1/2. 

The data formatter's job is to take requests and daia from 
the macro function unit and formal them in a way that is 
best for the frame buffer. In I lie c ase of our vector, the pixel 
addresses received by the data formatter are coalesced into 
rectangular tiles that are optimized for the frame buffer. The 
dala formatter also recognizes when special VRAM modes 
may be enabled to improve performance, based on the se- 
quence of data it receives from the macro function unit. For 
example, page mode (which is described in more detail later 
in ibis article) would be enabled during a vector draw. Tile 
dala formatter also slores Ihe current pixel address, vector 
color, and a host of other parameters for other primitives. 

The frame buffer controller generates signals for the VRAMs 
based on the requests from tlie data formatter. The controller 
looks at the sequence of writes and reads requested and 
adjusts the liming on the VRAM signals to maximize perfor- 
mance. For our vector, we only need to do simple writes 
into the frame buffer, and cycles can be as fast as 37.5 ns per 
pixel. More complex primitives might require data to be 
read, modified, and written back, possibly to a different 
frame buffer location. 

The graphics Chip supports an 8-bit-per-pixel frame buffer. 
This means that, using normal techniques, only 256 colors 
can be displayed simultaneously. This is not always ade- 
quate for today's graphics-oriented systems. Two methods 
can be employed to increase the perceived number of colors. 
The Brsl is dithering, in which an interleaved pattern of two 
available colors is used to visually approximate a requested 
color that is not directly available. The second approach is 
color recovery. Color recovery is visually superior to dit her- 
ing and is described later. 

The Model 712's entry-level configuration frame buffer uses 
four 2M-bit VRAM pails which allows screen resolutions of 
up to 1024 by 768 pixels. Adding four more VRAM chips on a 
daughter card enables screen resolutions up to 1280 by 1024 
pixels. 

In addition to the screen image data, data for the cursor, 
color lookup (able, and attributes are stored in offscreen 
frame buffer memory. This is an area in the video RAM 
frame buffer that is never directly displayed on the CRT. 
Data in this region Ls accessed in exactly the same fashion 
as the screen image data presenting a consistent interface 
to software driver writers. 

At this point our vector exists in the frame buffer, but can- 
not be seen by t he user. The video block is responsible for 



getting the screen image data from the frame buffer and 
convening it for display on the monitor. This display process 
is asynchronous to Ihe rendering process which placed our 
vector in the frame buffer. 

To get the data in the frame buffer to the monitor, the video 
controller first sends a request to the frame buffer controller 
to access the frame buffer data. This data is requested in 
sequential or scan-line order to match the path of the beam 
on the monitor. Next, the data from the frame buffer is run 
through a color lookup table to translate the 8-bit values into 
8 bits each of red, green, and blue. The graphics chip sup- 
ports two independent color lookup tables which are selected 
on a per-displayed-pixel basis by the attribute data This fea- 
ture helps eliminate color contention between applications 
sharing the frame buffer. Finally, cursor data is merged in by 
Ihe video block and the digital video stream is converted to 
analog signals for the monitor. 

This completes an overview of the life of a vector primitive, 
from a high-level description in the software driver to dis- 
play on the monitor. This basic data flow is the same for 
other primitives such as rectangles and text. 

Low Manufacturing Cost 

Low cost was the primary objective for the graphics chip 
design. As a measure of our success, the manufacturing cost 
for Ihe Model 712 graphics subsystem is 1/3 the cost of the 
original CRX graphics subsystem. In addition, the entry-level 
1024-by-708-pixel version of Ihe graphics chip costs five 
times less than the CRX subsystem. 

These cost reductions were achieved primarily through an 
aggressive amount of integration, which is summarized in 
Fig. 9. The graphics chip represents the culmination of a 
series of optimizations of Ihe CRX family, combining almost 
the entire GLU (graphical user interface) accelerator onto a 
single chip. The only major function not currently integrated 
is the frame buffer. Frame buffer integration is not feasible 
today because RAM and logic densities are not quite high 
enough and there is currently a cost advantage to using 
commodity VRAM parts. 

Since the introduction of the CRX subsystem, industry trends 
such as denser and cheaper memory and inexpensive IC 
gates have contributed to cost reductions in graphics hard- 
ware. However, the graphics chip's high level of integration 
also contributes cost reductions in the following areas: 

• Elimination of value-priced parts. The color lookup table 
and the digilal-to-analog converter (DAC) have traditionally 
been an expensive component of the graphics subsystem. 
This is especially true for systems capable of high resolu- 
tion ( 1280 by 1024 pixels, 135 MHz) and having multiple 
color lookup tables, such as the one built into the graphics 
chip. The digital phase-locked loop in the graphics chip 
replaces another expensive external part. 

• The density of FETs achieved with the graphics chip, over 
4500/mm 2 , is significantly higher than with previous genera- 
tions. This is important because silicon area is a major 
contributor to overall design cost. 

• IC packaging and testing contribute significantly to die cost 
of each chip in a system. Reducing the number of chips elimi- 
nates this overhead. The graphics chip has a full internal scan 
path and many internal signature registers to reduce test 
time and chip cost significantly. 
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Graphics Subsystem. Bus Interlace and Data Formatter and 
Date of Introduction. Macro Function Unit Frame Buffer Controller 
and HP 9000 Models 
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Printed circuit board area is a significant system cost. The 
elimination of a large number of chips not only reduced the 
printed circuit board area from about 60 to* for the CRX to 
14 in 2 for the graphics subsystem in the Model 712, but 
allowed the graphics to be Integrated directly onto the 
motherboard, eliminating connectors, a bulkhead, and other 
mechanical components. 

Power consumption for the graphics subsystem in the 
Mode] 712 is only six watts. This low power consumption 
reduces power supply capacity and cooling requirements 
and therefore cost. 

Manufacturing costs associated with parts placement, test, 
and rework are proportional to the number of discrete com- 
ponents in a system. The graphics chip and and other chips 
in the Model 712 include JTAG (IEEE 1 149.1 ) capability and 
signature generators to reduce the cost of printed circuit 
board test. 

Several factors made this high level of integration practical 
First, improved VLSI capabilities such as increased FET 
density, decreasing wafer costs and the availability within 
HP of video DAC technology. Secondly, the desktop avail- 
ability of design and simulation tools capable of handling a 
model of over 300,000 gates and 500,000 t ransistors. VLSI 
design and verification were accomplished on IIP 0000 
Series 700 workstations using Verilog, Synopsys, and many 
in-house K' development tools. The performance of the 
workstations allowed the gate-level simulation of entire 
video frames (1/60 s of operation ) of over 1.2 million pixels, 
which was the first time this was accomplished within HP. 

Performance 

The integration described above has also resulted in signifi- 
cant performance benefits. The two major reasons for the 
performance benefits are wider buses and increased clock 
rates. 

Wider buses are possible between blocks when iJiey are on 
the same piece of silicon. Wider buses allow better commu- 
nication bandwidth at a given clock rale, with very little cost 
impact. A good example on the graphics chip is the much 
improved communication between the macro function unit 
and the data formatter which once existed as separate chips. 
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Fig. 9. The evolution of HP's 
graphical user interface (GUI) 
accelerator. 



Increased clock rates are possible because of the elimina- 
tion of chip-to-chip synchronization delays, pad delays, and 
printed circuit board trace delays. This compounds the 
bandwidth benefit «>f wider buses. HP's CMOS20B technol- 
ogy allows the bus interface, macro function unit, and frame 
buffer controller blocks of the graphics chip to operate at 80 
MHz while the three DACs and two color lookup tables of 
the video block operate at 135 Mhz. 

Intelligent system-level design also made major contribu- 
tions to performance. A simple example is the block transfer 
commands which are responsible for transfeiTing data from 
system memory to the graphics chip and its frame buffer. A 
special mode was introduced to the memory and I/O con- 
troller in the PA 7100LC which allows fast sequential double- 
word transfers without incurring the overhead of two single- 
word transfers. This simple change boosted block transfer 
performance by 6096 

Besides designing w ith a system-level focus, the other 
driving philosophy was to extract as much performance and 
utility as possible from available technology. A good example 
of this is the use of the advanced features available in the 
latest 2M-bit and 4M-bit VRAMs. IIP has been instrumental 
in proposing and driving many of these enhancements 
within the .IEDEC committee over the last few years. The 
more important features include: 

Page mode. This feature eliminates the need to send redun- 
dant portions of the pixel address when writing into the 
frame buffer. The result is that many operations can write a 
pixel in as little as 37.5 ns versus the more typical 70 ns (see 
Fig. 10). The key here is that these operations must occur 
within a page of VRAM or a significant penalty is incurred. 
By default this page is long and narrow, which is good for 
block move and block transfer operations but bad for ran- 
domly oriented vectors and rectangles. To achieve a better 
performance balance, we made use of the next feature. 
Stop register/split Inuisfer. This feature allows the frame 
buffer to be organized in pages that are more square than 
long and narrow. Moving lo this organization improves ran- 
dom vector and small rectangle performance significantly 
while only slightly reducing large horizontal primitive 
performance (see Fig. 1 1 ). 
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Page Mode 



Nonpage Mode 



Time 



11 Cycles 
137.5 ns 





Tolal 
20 Cycles 
250 ns at 80 MHz 



Second Pixel 



Third Pixel 




11 Cycles 
137 5 ns 



11 Cycles 
137.5 ns 



11 Cycles 
137.5 ns 



11 Cycles 
137.5 ns 



Total 
M Cycles 
550 ns at 80 MHz 

Fig. 10. All Illustration of the performance improvement possible 
using the page mode to write pixels into the frame buffer. This ex- 
ample compares the pcrformaiii'f of each mode when just four pix- 
els are transferred to the frame buffer. 

• Block write. As mentioned earlier, operations such as text 
and rectangular fill fre(|uenlly require only one or two col- 
ors to be selected on a per-pixel basis. For Uiis reason 
VRAMs provide a mode (via a single bit ) in which a pixel's 
color can be selected from an 8-bit foreground or back- 
ground color stored in the VRAMs. This translates into an 8x 
performance improvement for these types of operations. 

The graphics chip's performance is summarized in Table 1. 
The table compares the performance of I he graphics chip al 
its theoretical hardware limit to its performance in 80-MHz 
and 60-MHz Model 712 workstations and the Model 720 
CRX. The final row in Table I, Xmark, Ls an industry-standard 
metric that is an average of several hundred X Window Sys- 
tem tests. 

Note that the graphics chip's hardware limit is significantly 
higher lhan the Model 712 system performance limits. This 
headroom means I hat future systems with higher levels of 
CPU performance or even more highly tuned software drivers 
will be capable of even better window system performance. 



Table I 

Summary of the Graphics Chip's Performance 



Benchmark 



Block transfer 8-bit 
pixels/s (frame buffer in 
system memory) 

Block transfer 8-bit 
pixels/s (system memory 
to frame buffer) 

Block move pixels/s 
(frame buffer to frame 
buffer, 500 by .500 pixels) 

Vectors/s ( 10-pixel, X 
compliant ) 

Text characters/s (6 by 
13 pixels/character) 

Rectangles/s(10by 10 
pixels/rectangle) 

Xmark 



Hard- 
ware 
Limit 



Model 
712/80 



Model 
712/60 



CRX 
720 



96 M 60 M 52 M 42 M 



20 M 



9M 



M 



■ \1 



47 M 40 M 31 M 40 M 



2.1 M 1.4 M 1.1 M 



1.0 M 681k 385 k 



1.7 M 790 k 588 k 



— 7.9 



6.0 



1.1 M 

295 k 
270 k 
5.6 



Compelling Functionality 

Beyond improving performance and dropping cost substan- 
tially il was an important goal to include useful new func- 
tionality in the graphics chip. Below are some of the more 
important additions. 

Software Video Support. One of the design goals for the Model 
712 was (o be able to play MPEG and 11.261 video sequences 
without expensive hardware acceleration. Through careful 
analysis of the decoding process it became dear that this was 
possible at full frame rates and high visual quality using a 
combination of the following algorithmic. PA 7100LC, and 
graphics enhancements: 

Rewriting the standard decode algorithms to make them as 
efficient as possible 

Adding key instructions to the PA 7100LC 
Implementing YUV-to-RGB color space conversion in the 
graphics chip. 



With Stop Register 

Page Is 
256 Pixels Wide by 
8 Pixels High 



Tolal Cycles to 
Draw a 10-Pixel Vector 
11+|4x3)+1HI4x3) = 46 



Without Stop Register 

Page Is 
1024 Pixels Wide by 
2 Pixels High 



Page 
Page 
Page 
Page 
Page 
Page 
Page 
Page 




Total Cycles to 
Draw a 10-Pixel Vector 
11+(11+3M11+3|+I11+3M11+3I+11 = 78 



Fig. 11. Improving performance 
with frame buffer pages that are 
more square l han long and 
narrow. 
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Fig. 12. The HP Color Recovery 
pipeline. 
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YUV encoding is used in many video formats. Ii allocates 
proportionately more bits to encode the brightness or lumi- 
nance (Y) of fhe image, and fewer bits to represent the color 
(IT) in the image. Since the human eye is more sensitive to 
brightness than color, this is an efficient scheme. However, 
since the graphics chip's frame buffer is stored in RGB for- 
mat, a conversion from YUV to RGB is necessary. 

This conversion is a good example of an operation tltat was 
relatively expensive in software (a 3-by-3 16-bit matrix mul- 
tiply) but simple to do in the the graphics chip hardware. 
This simple addition alone improves %ideo playback perfor- 
mance by as much as 30K and helps enable full 30-frame/s 
320-by-288-pixel resolution MPEG playback on a Model 
712/80. 

HP Color Recovery. The graphics chip incorporates a new 
display tecluiology called IIP Color Recovery. Using a low- 
cost 8-bit frame buffer and HP Color Recovery, the graphics 
chip can display images that are in many cases visually indis- 
tinguishable from those of a 24-bit frame buffer costing three 
times more. This feature is useful for the following applica- 
tion areas: 

Visual multimedia I JPEG, MPEG, etc.) 
Shaded mechanical ( AD models 
Geographical imaging system 
Document image management 
Visualization 

High-quality business graphics. 

A block diagram of the HP Color Recovery pipeline is shown 
in Fig. 12. 

The HP Color Recovery encoding scheme causes no loss of 
performance for rendering operations and is related to tradi- 
tional ordered dithering. Dithering is widely used to approxi- 
mate a large number of colors with an 8-bit frame buffer and 
is also available in the graphics chip. 

The IIP Color Recovery decode is much more sophisticated 
and based on advanced signal processing techniques. This 
circuitry' cycles at 136 MHz and achieves Over 9 billion op- 
erations per second. HP Color Recovery is described in 
RlOre detail in the article on page 51. 

Multiple Color Lookup Tables. Typically, entry-level work- 
station and personal computer graphics subsystems have 
had only a single color lookup table with a limited number 
of entries, usually 256. In the X Window System this results 
in the annoying flashing of backgrounds or window contents 
when a new application is started that lakes colors from 
existing applications. The graphics chip solves this problem 



in a majority of cases by providing two 256-entry color 
maps. For most interactions in which the user is focused on 
a single application and the window manager, this com- 
pletely eliminates the resource contention and results in a 
usually stable screen (see Fig. 13). 

Software Programmable Resolutions. One of the problems of 
past workstation graphics subsystems is that they operate at 
a fixed video resolution and refresh rate. This has posed 
problems in configuring systems at the factory and during 
customer upgrades. The graphics chip incorporates an ad- 
vanced digital frequency synthesizer that generates the 
clocks necessary for the video subsystem. This synthesizer, 
based on HP proprietary digital phase-locked loop technol- 
ogy, allows soft ware configurability of the resolution and 
frequency of the video signal. Titus, alternate monitors can be 
connected without changing any video hardware. Currently 
supported configurations include: 

• 6-10 by 480 pixels 60Hz, standard VESA timing 

• 800 by 600 pixels 60 Hz 

• 1024 by 1024 pixels 75 Hz and flat panel 

• 1280 by 1024 pixels 72 Hz. 

As new monitor timings appear, the graphics chip can sim- 
ply be reprogrammed with the parameters associated with 
the new monitor. 

Summary 

We created the graphics chip with the philosophies of syst.em- 
level-optiinized design and optimal use of technology. This 
enabled us to meet our goals of very' l*W manufacturing 
cost, leadership performance at our cost point, architectural 
compatibility, and introduction of some important new 
functionality. 
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Fig. 13. Comparison between 
single ;uid multiple color lookup 
tables, (a) One color lookup 
table. (1>) Two eolur lookup 
tables. 
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HP Color Recovery Technology 



HP Color Recovery is a technique that brings true color capability to 
interactive, entry-level graphics devices having only eight color planes 



by Anthony C. Barkans 



For many years the only practical way to display high-quality 
true color images was on a computer with a graphics sub- 
system providing at least 24 color planes (see the definition 
of true color on page 52 ). However, because of the high cost 
of color graphics devices with 24 planes, many users chose 
S-plane systems. Unfortunately, using these S-plane systems 
required giving up some color capabilities to save cost. 



HP has developed a tecltnique called HP Color Recovery 
which provides a method for displaying millions of colors 
within the cost constraints of an 8-plane system. For an ex- 
ample of the image quality provided by HP Color Recovery 
consider Fig. 1. Fig. la shows a close up of a jet plane stored 
as a full 24-bit -per-pixel true color image. Fig. lb shows the 
same jet plane displayed using a traditional 8-bit-per-pixel 
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Fig. L. A true color Image and i' s 
dithered representations, (a) 

True, color 24-bit image. (I>) Typi- 
cal eight -bil graphics dithered 
image, (c) An III' Color Recovery 
dithered image 
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True Color 

In this paper the term true color is used to define color reproduction such that the 
underlying digital quantization of the color within an image is not discernable by 
the human eye In other words a continuous spectrum of color, such as in a rain- 
bow, can be be displayed so that the color appears to vary continuously across the 
image In most computer graphics systems this is accomplished using 24 bits of 
color information per pixel With 24 bits, any single pixel can be displayed at one 
of 2 ?t (167 million) colors 

Some graphics systems may define true color to be represented by less than 24 
bits per pixel 



system. Finally, Fig. lc shows how the jet plane will he dis- 
played when using HP Color Recovery in an S-hit-per-pixel 
mode on the HCHX-8 graphics device. 

Of course, pretty pictures aren't enough. Therefore, one of 
the primary design goals for HP Color Recovery w r as to sup- 
ply the additional color capabilities without giving up interac- 
tive performance. Another goal was to be able to work with 
all types of applications running in a windowed environment 
such as the X Window System and HP VI 'E. The implemen- 
talion of IIP Color Recovery used in current HP work- 
stations meets these goals. 

Traditional Eight-Plane Systems 

Traditional eight-plane systems can display only 25(5 colors. 
Two approaches have been employed to get the best results 
with limited colors. The first is called either /iseudo color or 
indexed color. This method selects a set of 256 colors and 
then limits the application to using only that fixed set of 
colors. For many applications, such as word processing and 
business graphics, litis approach works reasomibly well. 
This is because the resultant images are made up of very 
few colors. However, when an application needs more than 
25(i colors, such as realistically shaded MCAD (mechanical 
computer-aided design) images or human faces in video se- 
quences, then another approach is needed. Since more than 
250 colors are required for these applications, a technique to 
simulate more colors is used. For these applications a tech- 
nique called dithering is employed. The idea of dither is to 
approximate a single color by displaying two other colors at 
intermixed pixel locations. For example, a grid of black and 
white pixels can be displayed to simulate gray. Such a grid 
of black and white pixels will indeed look gray when viewed 
from a distance. The primary problem with dithering is that 
since most people tend to work close to the display, dith- 
ered images are viewed as having a grainy or textured ap- 
pearance (see Fig. lb). 

Color Theory and Dither 

Before discussing the details of how HP Color Recovery 
works, an overview of color theory as it relates to computer 
generated images and dither should be helpful. This over- 
view describes how the human eye is tricked into seeing 
color, color precision in graphics, and a dithering method. 



Tricking the Human Eye 

It is often noted thai computer monitors use red. green, and 
blue (RGB) to produce true color images. A reasonable 
question to ask is: "Why use these particular colors?" If one 
examines the spectrum of visible light, it can be seen that 
red is at the end of the spectrum with the longest wave- 
lengths that the human eye can see while blue is at the other 
end. Note that green is in about the middle. Also note that 
white is a mix of all colors. Therefore by mixing varying 
amounts of red. green, and blue any color can be created. 
For example, forcing both the red and the green ( RT beams 
to be on at any single location will residt in a dot that ap- 
pears yellow to the human eye. 

Thus, one can create die visual appearance of any color by 
mixing the red, green, and blue components at any pixel 
location. However, it is interesting to note that the human 
eye can also perceive a new color when the component col- 
ors are mixed spatially. For example, a checkerboard of red 
and green pixels will be perceived as yellow when \iewed 
from a distance. It is this spatial mixing of color to form a 
new color that is exploited by dither. 

Color Precision 

In most systems that deal with true color, color is specified 
to eight bits for each of the three color components: red, 
green, and blue. The choice of eight bits is based on two 
factors. First, the human eye cannot distinguish an infinite 
number of shades because die dynamic range of the eye is 
limited. For the most part shaded surfaces rendered with 
eight bits per color appear smooth with the underlying quan- 
tization not readily apparent to the viewer. The second fac- 
tor that works in favor of using eight bits per color compo- 
nent as a Standard is that eight-bit bytes are very convenient 
to work with in a computer system. 

Simple Dithering 

When using a 24-bit color system, any displayable color 
component can be specified using eight bits. For example, 
consider the red component. When there is no red in a pixel 
the red component is specified with a binary value of 
(10000000, which is a decimal 0. A fiill bright red is specified 
as a binary number 11111111. which is decimal 255. Of 
course, high-end display systems, such as the HCRX-18Z, 
use 24 bits to store and display true color information The 
visual quality of these high-end displays is shown in Fig. la 
However, since low-cost systems typically have a total of 
only eight bits per pixel to store the color information, an 
approximation to the true color image is made. The most 
common method is dither using three bits each for the red 
and green components. This leaves two bits for blue. Using 
fewer bits for blue is based on the fact that the human eye 
has less sensitivity to blue. With fewer bits available per 
color component, the quantization of the colors becomes 
apparent to the viewer. The effects of using a limited num- 
ber of bits for each color can be seen in Fig. lb. 

Dithering approximates any color by using a combination of 
colors at adjacent pixels. When viewed front a distance the 
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image appears to be the correct color. However, since dith- 
ered systems can store only a limited number of bits in the 
frame buffer, the primary task of the dithering logic is to 
select the best set of values to use. 

For dithering purposes it is convenient to think of each 
eight-bit binary component of color as a number in a three 
point fine (3-5) representation. This representation means 
there are Uiree bits on the left side of the binary point and 
five bits on the right side of the binary point. For example, 
assume the true color value for red is given as the binary 
number 01011000. In a 3.5 representation die number be- 
comes 010. 1 1000 binary, which is 2.75 decimal. Since the 
final dithered values can only be three-bit integers, it can be 
seen that using only numbers two and three would be desir- 
able. Ideally, the dither would set 3/4 of the pixels to three 
and 1/4 to two. 

If we consider the original color component as being an 
eight-bit value in a 3.5 format, then the dither values stored 
in the dither table should be evenly spaced between 
000.00000 and 000. 11111 (decimal 0.0 to almost decimal 1.0). 
The output of die table is added to the original eight-bit 
color component. Once die addition is complete the value is 
truncated to the desired number of bits for storage in the 
frame bid'fer. As a simple example assume that we are con- 
tinuing to work with a red component that is originally spe- 
cified as the binary number 01011000. In addition assume 
that we are using a 2 x 2 dither to reduce the original 8-bit 
color component lo three bits. (The notation 2x2 dither 
means that the dither pattern will repeat in a 2 x 2 grid 
across the image.) To use a 2 x 2 dither, Ihe least-significant 
bits of the X and Y window addresses of the pixel are used 
to index the dither table. The following example shows how 
a 2 x 2 dither is applied to one pixel of the true color value 
for red. Tabic I represents the values in a 2 x 2 dither table. 

Table I 
A 2 ■ 2 Dither Table 

Indexes Dither Value 

LSBofY LSBofX Binary Decimal 

0 0 .10100 .625 

0 1 .01100 .375 

1 0 .00100 .125 
1 1 .11100 .875 

At the upper left of the window the X and Y addresses are 
both 0. To dither the data for this pixel location using our 
color value for red we do the following: 



Input color 

Dilher value (from Table I) 
Result 

Truncated three-bit value 



Binary 

010.11000 
+ .10100 



Decimal 

2.750 
+ .625 



011.01 

1)11 



3.375 
3 




Therefore at address 0,0 we would store a 01 1 binary in the 
frame buffer for red. Applying the above dilher would result 



Fig. 2. Results after applying dither. Bach box represents a pixel 
location on the display screen For example, address (0.0) is defined 
as the upper left comer of the display. Also note that the numbers 
stored al each pixel location represent die results of applying the 
dither values given in Table I to a red component of color originally 
specified as (1101 1000 binary (2.75 in our 3.5 notation). 

in three of ihe four pixels within every four-pixel block being 
stored in the frame bid'fer with a value of 01 1 (see Fig. 2). 
The fourth pixel in each block, the one with Ihe LSB of Y set 
to a I and the LSB of X set to a 0. will have a 010 stored in 
the frame buffer. When a region of this color of red is 
viewed from a distance the color would appear to be the 
correct v alue of 010. 1 1000. If the dithered jet plane shown in 
Fig. lb is examined, it can be seen that it is dithered using a 
method similar to the one described above. 

From a distance the colors in the dithered image are inte- 
grated by the eye so that Ihey appear correct However, the 
fundamental problem with dilher is that most dithered im- 
ages are viewed up close and so the dithering pattern is no- 
ticeable in the image. 

Dithering Is Key 

It is Important to realize- thai to approximate any true color 
value, a Spatial region of Ihe screen is required. This often 
leads people to say thai dithering is a method lhal trades off 
spalial resolution for color resolution. However, this is mis- 
leading. Some people believe lhal a single-pixel object can- 
not be dithered. Actually a single-pixel objeci can be dith- 
ered. The result is that Ihe object will be one of Ihe two 
dilher colors, doing back to Ihe example above, a single- 
pixel red objeci specified as binary 01011000 (decimal 2.75) 
will be stored at any single pixel location as either binary 010 
or 01 1 (decimal two or three). Taken by itself, any single 
pixel is not a perfect approximation of the true color. How- 
ever, it is still a reasonable approximation. 

The idea of being able to encode each pixel in Ihe image 
Independently by using dither is key to enabling color re- 
covery to work in an interactive environment. As a historical 
note il should be mentioned lhal over the last few years 
several people have developed methods to bring true color 
capabilities to eight-Oil graphics devices. However, Ihese 
attempts have been based on complex niiiltipixcl encoding 
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schemes. For the most pan they have applied data compres- 
sion techniques lo the data stored in the frame buffer. These 
methods have produced high-qualify images, but the encod- 
ing is so complex that the user must give up interactive per- 
formance to use them. Because of the performance prob- 
lems these methods have not been widely adopted by the 
computer graphics community. 

HP Color Recovery 

The simplest explanation of HP Color Recovery is that il 
performs the task your eye is asked to do with an ordinary 
dithered system. In essence, an IIP Color Recovery system 
lakes 24-bit true color data generated by an application and 
dithers il down to eight bits for storage in the frame buffer. 
Then as the frame buffer data is scanned from the frame 
buffer to the display, il passes through specialized digital 
signal processing (DSP) hardware where the work of pro- 
ducing millions of colors is performed. The output of the 
DSP hardware is sent to the display where millions of colors 
can be viewed. It is important to recognize that since the 
data stored in the IIP Color Recovery frame buffer is dith- 
ered, thousands of applications can work with it. It is also 
important to recognize that these applications will run at full 
performance in an interactive windowed environment. In 
other words, applicat ions do not need to be changed to lake 
advantage of HP Color Recovery. 

The Process 

HP Color Recovery is a two-part process. First, true color 
information generated by the application is dithered and then 
stored in the frame buffer. The type of application generating 
the true color information is immaterial. For example, true 
color data can be generated by a CAD application program 
or as part of a video sequence. The dithering may be done in 
a software device driver or in the hardware of a graphics 
controller. It is very important lo note that each pixel is 
treated independently. This pixel independence is key to the 
ability to work within an interactive windowed environment. 
The second pan of the IIP Color Recovery process is to ni- 
ter the dithered data. The filter is placed between the output 
of the frame buffer and the DACs that drive the monitor. Fig. 
3 shows t he HP Color Recovery process starting from when 
an applicat ion generates true color data to when the image 
appears on the screen. Note that "application" refers to any 
program that generates true color data for display. 



After the application generates the clala, it is sent to the de- 
vice driver. The function of the driver Ls to isolate the applica- 
tion from hardware dependencies. The driver is supplied by 
HP. It causes hardware dithering lo be used when possible. 
However, there are times when the driver must perform the 
dither in software. It is important to note that compared to 
other dithered systems, there is no performance penalty 
suffered by an application using IIP Color Recovery dither. 

The frame buffer stores the image data. Note that in most 
current systems the output of the dithered frame buffer is 
sent to the display, resulting in the common patterned ap- 
pearance in the image. However, with IIP Color Recovery, as 
the frame buffer data is scanned, it is sent through a special- 
ized digital signal processing (DSP) circuit. The DSP is a 
sophisticated circuit that removes the patterning from the 
dithered image stored in the frame buffer. This circuit per- 
forms over nine billion operations per second. Despite this 
enormous amount of processing the circuit is surprisingly 
small. It is this small size thai makes HP Color Recovery 
inexpensive enough to be considered for inclusion in low- 
end graphics systems. 

The Dither Process 

In IIP Color Recovery the quality of the displayed image 
depends on the dither used to encode the image. During the 
development of IIP Color Recovery it was found that, the 
size of the dither region determines how well a color can be 
recovered. It was found thai from a region of 2" pixels the 
technique can recover about N bits of color per component. 
Therefore, an eight-bit frame buffer that stores data in 3-3-2 
format (3 bits each for red and green and 2 bits for blue) 
would need a dither region of 32 pixels for each color com- 
ponent to recover 5 additional bits. Thus, using a 32-pixel 
dither region, an area in the image of uniform color can have 
the same visual quality as an 8-8-7 image. For example, the 
sky behind the jet plane in Fig. lc was recovered to within 1 
bit of the original 24-bit true color data shown in the top 
image. 

Most dithers use a 4 x 4 dither region. Since a 4 x 4 region 
covers only 16 pixels, a larger dither region is needed for HP 
Color Recovery. Therefore, a dither table with 32 entries 
organized as 2 x 16 was selected. (The reason for this odd 
shape is discussed later in this paper. ) In addil ion, most dith- 
ers are as simple as the one described earlier in this paper. 
However, there are cases in which a simple dither does not 
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work well. Note that in using the simple dither method de- 
scribed above, all true color values from binary 1 1 100000 to 
11111111 would dither to 1 1 1. For HP Color Recovery the 
dither table includes both positive and negative numbers. 
This improves the color range over which the dither is 
useful. 

The HP Color Recovery dither is a little different from most 
dithers However, n is on the same order of complexity, it 
should also be noted that the HP Color Recovery dither is 
included in the liardware of all HP graphics workstations that 
support this technique. This means that using the HP Color 
Recovery dither does not cause a decrease in performance. 

The Filter Process 

In the example given earlier a red color component repre- 
sented by the binary value U1011000 (2.75 in decimal) was 
used to illustrate simple dithering. For this example we used 
a 2 x 2 dither region in which the end result of the dither 
was that 3/4 of the pixels stored in the frame buffer were set 
lo3 (011 land 1/4 of the pixels were set to i2 (010). It is easy 
to see that if we average the four pixels in the 2x2 region 
we will recover the original color. This can be done as fol- 
lows: 

(|value_l x number_set_to_value_l] + [value_2 x 
1 1 uti iber_se t_to_value_2 |)/total_number_pixels 

Using the example data we obtain: ([3 x 3] + (2 x l] )/4 = 2.75. 

This averaging works very well in regions of constant color, 
such as the sky behind I he jet plane in Fig. 1. However, there 
is one fundamental issue that must be addressed for HP 
Color Recovery to be viable and that is how to handle edges 
in the image. If edges are not accounted for then the resul- 
tant image will blur. The two-dimensional representations of 
an area of a display screen shown in Fig. 4 are used to Illus- 
trate the problem of edge detection and the way the prob- 
lem is addressed in IIP Color Recovery. 

As in Fig. 2, each box represents a pixel location on the dis- 
play screen. In Fig. 4a the numbers represent the original 
true color data for one of the color components (e.g., red ) in 
a 24-bit per pixel system. Fig. 4b shows the same region af- 
ter simple dithering has been applied. Fig. 4c shows the 
pixel values after the application of HP Color Recovery. Fig. 
4c pixel values represent, the color data that would be dis- 
played on the computer screen. 

Region A in each of these figures is an area of constant color, 
whereas region B encompasses an edge. For illustration pur- 
poses, the dither region is again assumed to be 2 x 2 pixels. 

The dithered color data shown in Fig. 4b is derived from the 
original color data shown in Fig. 4a and from using the sim- 
ple dithering technique described in connection with Table L 
The data shown in Fig. 4b is what would be stored in the 
frame buffer and displayed in a typical dithered system (e.g.. 
Fig. lb). 

When it is lime to display Pix_l the data for the four pixels 
shown as Region A in Fig. 4b would be sent to the filter. The 
data stored in the region would be summed and then divided 
by the number of pixels in the region. The sum of the pixels 
in Region A is 1 1 and 1 1/4 = 2.75. Thus, the output of the 
filler when evaluating PixJ would be 2.75. This output value 
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Fig. 4. (a) Pixel values for the original 24-l»ii per pixel color data, 
(b) The color data from Fig. 4a after it has been dithered anil placed 
in the frame buffer. This Would l«- the data displayed in a typical 
dithered system with the result appearing as in Fig. lb [cj The 
pixels from Fig. 4b after applying HP Color Recovery. 

would be displayed on the computer display at PixJ's loca- 
tion Sole that the out put of the filter is the exact value of 
the original data at that point in Fig. 4a. 

The next pixel along I he scan line to be evaluated is PixJ. 
The filter region for evaluating Pix_2 would include the two 
rightmost pixels of region A and the two leftmost pixels of 
region B (see Fig. 4b). Applying the filter operation for Pix_2 
again residts in the output value matching the value at that 
location in Fig. 4a (2.75). 

If the evaluation is done on PixJ. the pixels in region B 
would be summed and then divided by the number of pixels 
in the region, and the result would be 4.50. This value Ls very 
different from the original data v;due of 2.75 in Fig. 4a. I 'sing 
the value of 1.50 at Pix_3 would result in edge smearing. To 
solve this problem a special edge detector that looks for 
edges in noisy data is used. The idea is to compare each 
pixel in the filter region with a value that is within +1 of the 
pixel being evaluated. Since the data stored al Pix_3 is a 3, 
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only pixels within region B that have a value of 2, 3, or 1 
would pass the edge compare. The values thai pass the edge 
detector are then summed and the total is divided by the 
number of pixels that pass llie edge compare. For Pix_3, only 
Pix_3 and the pixel below il would pass the edge detector. 
Summing the two passing values together and dividing by 2 
gives a result of 2.50. This value is slightly different from the 
original value of 2.75. but it is a better estimation to the orig- 
inal than the 1.50 obtained without the edge detection. The 
displayed values for the entire example region are shown hi 
Fig. 4c. 

Software Considerations 

Since dithered frame buffers are in common use today, 
many existing software applications can work with a dith- 
ered frame buffer. All of these applications could work with 
HP Color Recovery. 

On products that use the Model 712's graphics chip, which 
is described in the article on page 43, and HP's Hyperdrive 
(HCRX), HP Color Recovery is supported. In these products 
we have chosen to have HP Color Recovery enabled as the 
default for 3D applications rim in an eight-bit visual environ- 
ment. Thus when using the 3D graphics libraries Slarbase, 
PH1GS, or PEXlib, and opening an application in an eight-bit 
visual environment with true color mode, HP Color Recov- 
ery will normally be enabled. Of course, setting an applica- 
tion to use a pseudo color map will disable HP Color Recov- 
ery and give the application the desired pseudo color 
capability. Because Xlib is tied info the pseudo color model 



rather than the 3D libraries, Xlib applications leave IIP 
Color Recovery off by default. However, a mechanism is 
supported that allows IIP Color Recovery to be enabled 
when using Xlib. 1 The biggest change is that Xlib applica- 
tions must do their own dithering. 

Implementation 

The implementation of HP Color Recovery was based on the 
assumption that color recovery would be most useful in 
etltry-level graphics products. Entry-level graphics products 
are defined as products in which there is storage for only 8 
bits per pixel in the frame buffer. These same products that 
benefit the most from IIP Color Recovery are also the ones 
where product cost must be carefully controlled. Therefore, 
the implementation effort was driven with a strong sense of 
cost versus end user benefit. 

Dither Table Shape. As mentioned earlier, the dither region 
shape used with HP Color Recovery is 2 x 1(3. The optimum 
shape would be closer to square, such as 4 x 8. However, the 
filter circuit needs storage for the pixels within the region. A 
2 x 16 circuit requires that the current scan line's pixel and 
the data for the scan line above be available. This means 
that as data for any scan line enters the circuit, it is used to 
evaluate pixels on the current scan line. In addition, the data 
is saved in a scan line buffer so it can be used when evaluat- 
ing the pixels on the next scan line (see Fig. 5). It should be 
noted that the storage for a scan line of data uses approxi- 
mately one half of the circuit area in the current implemen- 
tation. Therefore, if a 4 x 8 region had been used, three scan 
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Fig. 6. A simplified representation of the logic circuitry that exists 
in each of the logic blocks in Fig. 5. 

line buffers would have been required, almost doubling the 
cost of the HP Color Recovery logic. 

Filter Function Logic. As explained earlier. Ihe HP Color Re- 
covery filter function averages the data v\ iihin a region by 
summing the data for the pixels that pass an edge compare 
operation. The sum is then divided by the number of pixels 
that pass the edge compare. Typically, building the logic for 
a filter function like tltis is difficult and costly because it 
requires a divide circuit running at the video clock rate of 
135 MHz. The HP Color Recovery filter function is imple- 
mented so that this is not a problem, 

The implementation details of the filter function are complex. 
However, if we ignore the high-speed pipeline issues and 
some minor adjustments required to optimize image quality, 
we can reduce the implementation of the filler function to 
the following equation: 

k-l 

y (Frame_Buri'er_r>ata j )( W,) + (Kvalualed_Pixel)[W 1 | (1) 

1=0 

where k is the number of pixels in Ihe filler and Wj is a flag 
equal to one when a pixel passes an edge compare operation 
and zero when il dnesn'l This flag can he thought of as the 
oiiipui or the comparator shown in Fig. 6. 

The idea behind this equation is that if a pixel passes the 
edge compare, include il in Ihe total. On Ihe other hand, if a 
pixel fails the edge compare, then substitute the data for ihe 
pixel being evaluated for the failing pixel. The overriding 
assumption is that the pixel being evaluated is a reasonably 
good guess of the true color data. The worst case is that all 
the pixels around Ihe sample fail ihe edge compare and the 
dithered color is used for thai location. Since dithering uses 
a reasonable sample al each local ion this extreme case re- 
sults in a reasonable image being displayed. 

To see how this works let's look at two examples. In Ihe firs! 
example assume that the pixel being evaluated is a single 
red dot specified using (1101 1000 binary (2.75 in our decimal 
numbering system ). This is ihe same color used in some ol' 
Ihe examples described earlier. However, this lime lei us 
assume that il is dithered to a value of 01 1. Also assume that 
this pixel is surrounded by green. Since Ihe edge compare is 
done on a per-color basis, all the pixels in the region except 
the pixel being evaluated will fail Ihe edge compare. In this 
case we will add a red value of 01 1 Ihirty-two times. The 
rest ill mil of Ihe adder tree in Fig. 5 will have a red value of 



01 100000 (3.00 in decimal). Although tltis is not exact it will 
appear as a red dot in the middle of a green region. In other 
words, a reasonable approximation. 

In the second example assume a region that is filled with red 
is si>ecified with the same eight-bit binary value of 0101 1000. 
Also assume the simple dither method described earlier is 
used. In this case 3/4 of the pixels will be stored as 01 1 . The 
other 1/4 will be stored as 010. SUice none of the pixels fails 
the edge compare we will send twenty-four pixels with the 
value of 01 1 and eight with the value 010 to the adder tree. 
The results of the adder will be a binary value of 0101 1000 
(2.75 decimal ). In tltis case the output of HP Color Recovery- 
will match the input true color data exactly. 

Hardware details. The filtering logic, which was shown in a 
systems context in Fig. 3. is expanded in Fig. 5. As the frame 
buffer is scanned, each pixel in the display is sequentially 
sent to the logic shown in Fig. 5. The left side of the figure 
shows the path taken as the data for each pixel read from 
the frame buffer enters the filtering logic. The data is sent 
both to a pipeline register for immediate use, and to a scan 
line buffer for use when the next scan line is being evalu- 
ated. The 32 registers shown in Fig. 5 store the data for the 2 
x 16 region being ev aluated. These registers are clocked at 
the pixel clock rate. Note thai the data for each pixel on Lite 
display will pass through ihe location marked with the X. 
When a pixel is at the location X, it is called the pixel being 
evaluated. This means that the results of applying equal ion 
1 are assigned to the display at the screen address of X. 

The 32 pixels stored in Ihe pipeline registers shown in Fig. 5 
are sent Ihrough blocks of logic that perform the inner loop 
evaluation of equation 1. This inner loop is essentially an 
edge detector. The logic shown in Fig. 0 allows only pixels 
that have similar numeric values to the pixel being evalualed 
to be included in the summation. The summation logic is 
simply an adder tree that sums the results of the pixels pass- 
ing Ihe edge compare. The filter function is performed in 
parallel for all the pixels within Ihe filler region. 

Given the complexity of Ihe function being performed in Ihe 
filler circuit, ihe circuit is surprisingly small. The entire filler 
circuit is made up of approximately 35,000 transistors. Com- 
pared to the number of transistors required to increase the 
number of color planes, Litis is very small. For example in- 
creasing the number of color planes from 8 to 1(5 on a lypi- 
cal SVGA (Super VGA) system (1024 x 708-pixel resolution) 
requires over 8,000.000 transistors, which is 1M bytes of 
additional frame buffer memory. Because of the small size of 
the HP ( 'olor Recovery circuit, it is inexpensive enough to 
be included in entry-level graphics systems. 

Questions and Answers 

Thus far the concepts behind IIP Color Recovery have been 
discussed. It has been shown that HP Color Recovery can 
supply additional color capabilities to low-end graphics sys- 
tems while maintaining an interactive windowed environ- 
ment. The following are answers to the most frequently asketl 
questions about Ihe practical use of HP Color Recovery. 

• Question: Is there a difference between a 24-bit true color 
image and one displayed using HP Color Recovery? 
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Answer: Yes. If you view a 24-bit image and an MP Color 
Recovery image side by side there are differences. For exam- 
ple, the bark edge of the wing in Fig. lc has some artifacts in 
it. Al normal size the artifacts can be found but are less no- 
ticeable than in Fig. lc. 

Question: How many colors are reproducible with HP Color 
Recovery? 

Answer: hi the best case HP Color Recovery can provide up 
to 23 bits of accuracy. However, in typical images about four 
million colors can be reproduced. 

Question: Are artifacts introduced by HP Color Recovery? 

Answer: In areas of very low contrast, artifacts will show 
up. Again the back edge of the wing in Fig. lc is a good ex- 
ample. 

Question: Does IIP Color Recovery look the same on all IIP 
products that support it? 

Answer No. The first implementation was designed for the 
graphics Chip used in the HP 9000 Model 712 workstation. 
After that design was finished some improvements were 
made which ended up in the HCRX family of graphics de- 
vices. These changes are hidden deep in the details of the 
implementation, enabling any application using HP Color 
Recovery on one product to work without change on the 
other products. 

Question: Do applications need to change to use HP Color 
Recovery? 

Answer: II" the application was written using a 3D applica- 
tion program interface the answer is no. Of course it must 
be running in an eight-bit visual environment on a device 
that supports HP Color Recovery. In addition, the applica- 
tion must have been written to use the 24-bit true color 
model. However, if t he application was written using Xlib 
then it must be changed to do the dithering. Details can be 
found in reference 1. 

Question: Is there a way to turn HP Color Recovery off? 

Answer: Yes. Set the environment variable HP^DISABLE_ 
COLOR.RECOVERY to any value. 

Question: What happens to the color map in the IIP 9000 
Model 712s graphics chip when HP Color Recovery is en- 
abled? 

Answer: In the graphics chip there are two hardware color 
maps. By default, the XI 1 server permanently downloads 
the default color map info one of these hardware color 
maps. If HP Color Recovery is enabled the remaining color 
map is used by HP Color Recovery. See the article on page 
43 for more information about these color maps. 

Question: What happens to the color map on HCRX graph- 
ics when HP Color Recovery is enabled? 

Answer: On HCRX graphics deuces there are two hardware 
color maps in the overlay planes and two in the image planes. 
By default, the XI 1 server permanently downloads the default 
color map into one of the overlay planes' hardware color 
maps. This is true in each of the following configurations: 
c The HCRX-8 and HCRX-8Z frame buffer configurations with 
no transparency have one hardware color map in the over- 
lay planes and two in the image planes that are available. In 



this configuration the HP Color Recovery color map can be 
downloaded into any of die available hardware color maps, 
o The HCRX-8 and HCRX-SZ frame buffer configurations 
with transparency have only one hardware color map in 
the overlay planes and only one in the image planes. Since 
the hardware color map for the overlay planes already has 
the default color map loaded into it. there is only one 
color map available for HP Color Recovery to choose 
from. Therefore, in this configuration the HP Color Recov- 
ery color map is downloaded into the remaining hardware 
color map. 

0 The HCRX-24 and IICRX-24Z frame buffer configurations 
with or without transparency have one hardware color 
map in the overlay planes and two in the image planes 
that are available. In this configuration, when using an 
eight-bit visual depth the HP Color Recovery color map 
can be downloaded into any of the available hardware 
color maps. 

• Question: Does HP Color Recovery work with logical raster 
operations? 

Answer: Yes. Like any dithered frame buffer system. HP 
Color Recovery works with raster operations such as AND, 
OR, and X0R. 

• Question: How do image processing applications interact 
with HP Color Recovery? 

Answer: There are t wo basic- classes of image processing 
applications: feature finding ami image enhancement. 
- Feature finding. Most feature-finding applications are 
based on edge detection. The results of running one of 
these types of applications can be displayed using HP 
Color Recovery. However, as with other dithered frame 
buffers, any application using the frame buffer as the 
image source may have problems if it does not account for 
the dither. 

■-' Image enhancement. Image enhancement applications are 
typically used to enhance images For the human Visual sys- 
tem. The goal of many of these applications is to bring out 
low-level features of the image. It is possible to preprocess 
the image and send it to HP Color Recovery. However, if 
there is a need for an extremely high-quality image (e.g., 
medical imaging) a 24-bit frame buffer may be necessaiy. 

• Question: If an image is dithered using a dither method 
other than the one developed for HP Color Recovery, can it 
be displayed on a system that supports HP Color Recovery? 

Answer Yes. One option is to turn HP Color Recovery off. 
However, the image can be processed with HP Color Recov- 
ery on. In this case the image will be viewable. The image 
quality will be comparable M viewing the image on a typical 
dithered system, but the dithering artifacts will be replaced 
with a new set of artifacts. 

• Question: Can an image created using the HP Color Recovery 
dither, method be viewed on an eight-bit system that does 
not support HP Color Recovery? 

Answer: Yes. However, it is important to realize that Without 
the HP Color Recovery back end the dithering artifacts will 
be visible in the image. 

• Question: Can a user read the frame buffer data? 
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Answer Yes. However, as with any dithered system there is 
the issue of precision. For example, if the red data is gener- 
ated with eight bits of precision, then the readback will give 
a three-bit dithered value for the data. The data on readback 
is not the same as the eight-bit value generated by the appli- 
cation. 

• Question: Does HP Color Recovery work with multimedia 
applications? 

Answer: Yes. By removing the dithering artifacts, image 
quality- during MPEG ( video ) playback is improved 

• Question: Does HP Color Recovery impact application per- 
formance? 

Aoswet: No. The HP Color Recovery dither is implemented 
in fast hardware in both the Model 712 's graphics chip and 
the HCRX graphics subsystem. When hardware dithering 
Cannot be used, such as with virtual memory double buffer- 
ing, a software dither is performed by the device driver. Since 
the dither is the same complexity as common dilhers, there 
is no performance penalty for using HP Color Recovery 
when compared to using other dithered systems. 

In addition, the DSP circuit in the back end is placed in the 
paih of the data being scanned into the monitor. As such the 
DSP does get in the path (without affecting application per- 
formance) when the system is performing what the user 
sees as interactive tasks. 

• Question: Can an image generated using HP Color Recovery 
be displayed on output devices other than monitors (e.g., 
printers)? 



Answer Many applications generate a print file. In tliis case 
the data displayed on the monitor is not used to create the 
print file. Therefore. HP Color Recovery- will not interfere 
with the output. Another method used to generate hardcopy- 
is a screen dump. Unfortunately, a complete solution for 
dumping a color-recovered image to a printer is not avail- 
able yet 

Conclusion 

Color recovery brings added color capabilities to entry-level 
systems. Since the technology is based on dither, these addi- 
tional color capabilities can be brought to an entry-level 
system while maintaining an interactive environment that 
supports many current applications. 
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Real-Time Software MPEG Video 
Decoder on Multimedia-Enhanced PA 
7100LC Processors 

With a combination of software and hardware optimizations, including the 
availability of PA-RISC multimedia instructions, a software video player 
running on a low-end workstation is able to play MPEG compressed video 
at 30 frames/s. 

by Ruby B. Lee, John P. Beck, Joel Lamb, and Kenneth E. Severson 



Traditionally, computers have improved productivity i>y 
helping people compute faster and more accurately. Today, 
computers can further improve productivity by helping 
people communicate better and more naturally. Towards 
this end. at Hewlett-Packard we have looked for more natu- 
ral ways to integrate commimication power into our desktop 
machines, which would allow a user to access distributed 
information more easily and communicate with other users 
more readily. 

We fell that adding audio, images, and video information 
would enrich the information media of text and graphics 
normally available on desktop computers such as work- 
stations and personal computers. However, for such en- 
riched multimedia communications to he useful, it must be 
fully integrated into the user's normal working environment, 
Hence, as the technology matured we decided to integrate 
increasing levels of multimedia support into both the user 
interface and the basic hardware platform. 

In terms of user interface, we integrated a panel of multi- 
media icons into the IIP VUE standard graphical user inter- 
face, which conies with all HP workstations. These multi- 
media icons are part of the IIP MPower product. 1 IIP 
M Power enables a workstation user to receive and send 
faxes, share printers, access and manipulate images, hear and 
send voice and CD-quality stereo audio, send and receive 
multimedia email, share an X window or an electronic white- 
board with other distributed users, and capture and play back 
video sequences. The HP MPower software is based on a 
Client/server model, in which one server can service around 
20 clients, which can be workstations or X terminals. 

In terms of hardware platforms, we integrated successive 
levels of multimedia support into the baseline PA-RISC work- 
stations.--^ 4 First, we integrated support for all the popular 
image formats such as JPEG ( Joint Photographic Experts 
Group)! compressed images/' Then, we added hardware 
and software support for audio, starting with 8-kHz voice- 
quality audio, followed by support for numerous audio for- 
mats including A-law, u-law. and 16-bit linear mode, with up 
to 48-kHz mono and stereo. This allowed high-fidelity, 

t JPtG is an international digital image compression standard for continuous-tone (multi- 
level) still images Igrayscale and colotl 



44.1-kHz stereo. 16-bit CD-quality audio to be recorded, 
manipulated, and played back on HP workstations. At the 
same time, we supported uncompressed video capture and 
playback. 

hi January 1994, IIP introduced HP MPower 2.0 and the 
entry-level enterprise workstation, the IIP 9000 Model 712, 
which is based on the multimedia-enhanced PA-RISC pro- 
cessor known as the PA 7100LC.''- 7 - s The video player inte- 
grated in the MPower 2.0 product is the first product that 
achieves real-time MPEG-i (Moving Picture Experts 
Group) 11 \ifleo decompression via software running on a 
general-purpose processor. Typically real-lime MPEG-1 de- 
compression is achieved via special-purpose chips or 
boards. Previous attempts at software MPEG-1 decompres- 
sion did not attain real-time rates.'" The fact thai this is 
achieved by the low-end Model 712 workstation is significant. 

In I his paper, we discuss the support of MPEG -compressed 
Video as a new (\ideo) data type. In particular, we discuss 
I he technology that enables the \ideo player integrated into 
the HP MPower 2.0 product to play back MPEG-compressed 
\ideo at real-time rates of up to 30 frames per second. 

Digital Video Standards 

We decided to focus on the MPEG digital video format be- 
cause it is an ISO (International Standards Organization) 
standard, anil ii gives the highest video fidelity 31 8 given 
compression ratio of any of the formats tliat we evaluated. 
MPEG also has broad support from the consumer electron- 
ics, telecommunications, cable, and computer industries. 
The high compression c apability of MPEG translates into 
lower storage costs and less bandwidth needed for transmit- 
ting video on the network. These characteristics make 
MPEG an ideal format for addressing the need for detail in 
the video used in technical workstation markets and com- 
puter-based training in commercial workstation mar kets. 

MPEG is one of several algorithmic-ally related standards 
shown in Fig. 1. All of these digital video compression stan- 
dards use die discrete cosine transform ( DCT ) as a funda- 
mental component of the algorithm. Alternatives to discrete 
cosine-based algorithms that we looked at include vector 
quantization, fractals, and wavelets. Vector quantization 
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Fig. 1. Digital video standards based on the discrete cosine transform. 

algorithms are popular on older computer architectures be- 
cause they require less computing power to decompress, but 
this advantage is offset by poorer image quality al low band- 
width (high compression) compared I" MPEG for practical 
vector quantization methods. Algorithms based on wavelet 
;uid fractal technology have the potential to deliver video 
fidelity comparable to MPEG, but there is presently a lack of 
industry consensus on standardization, a key requirement for 
our use. 

Another advantage of a high-performance implementation of 
MPEG is the ability to leverage the improvements to the 
other DC'T-based algorithms. Although the relationships 
shown in Fig. 1 do not represent a true hierarchy of algo- 
rithms is useful for illustrating increased complexity as one 
moves from JPEG to MPEG-2, or from 11.261 to MPEG-2. 

All of these formats have much in common, such as the use 
of the OCT for encoding. The visual fidelity of the algorithms 
was die key selection criterion and not ease of implementa- 
tion or performance on existing hardware. 

Although JPEG supports both lossy and lossless compres- 
sion, die term JPEG is typically associated with the lossy 
specification, t The primary goal Of JPEG is to achieve high 
compression of photographic images with little perceived 
loss of image fidelity. Although it is not an IS( ) standard, by 
convention, a sequence of JPEG lossy images to create a 
digital video sequence is called motion JPEG, or MJPEG. 

11.201 is a digital video stiuidard from the telecommunica- 
tions standards body FTU-TSS (formerly known as ("( ITT). 
11.201 is one of a suite of conferencing standards that make 
up the umbrella 11.320 specification. H.2(il is often referred 
to as P*6I (where P is an integer) because it was designed 
(0 111 into multiples of 64 kbits/s bandwidth. The first frame 

t In lossless compression, decompressed dale is identical to the original image dala In lossv 
compression, decompressed dala is a good approximation 0 ! the original image dala 



(image) of an H-261 sequence is for all practical purposes a 
highly compressed lossy JPEG image. Sul>sequeni frames 
are built from image fragments (blocks) that are either 
JPEG-like or are differences from the image fragments in 
previous frames. Most video sequences have high frame-to- 
frame coherence. This is especially true for video conferenc- 
ing. Because the encoding of the movement of apiece of an 
image requires less data than an equivalent JPEG fragment. 
H.261 achieves higher visual fidelity for a given bandwidth 
than does motion JPEG. Since the encoding of the differ- 
ences is always based on the previous frames, the technique 
is cslledfonmrd differencing. 

The MPEG-1 specification goes even further dian IL261 in 
allowing sophisticated techniques to achieve high fidelity 
with fewer bits, hi addition to forward differencing, MPEG-1 
allows backward differencing (which relies on information 
in a future frame) and averaging of image fragments. (For- 
ward and backward differencing are described in more de- 
tail in the next section.) MPEG-1 achieves quality compara- 
ble to a professionally reproduced WHS videotape even at a 
single-speed CD-ROM data rate (1.5 Mbits/s). HU MPEG-1 
also specifies encodings for high-fidelity audio synchro- 
nized with the video. 

MPE< i-2 contains additional specifications and is a superset 
of MPEG-1. The new features in MPEG-2 are targeted at 
broadcast television requirements, such as support for 
frame interleaving similar to analog broadcast techniques. 
With widespread deployment of MPEG-2. the digital revolu- 
tion for video may be comparable to the digital audio revolu- 
lion of the last decade. 

The approximate bandwidths required to achieve a level of 
subjective visual fidelity for motion JPEG. H.261, MPEG-1. 
and MPEG-2 are shown in Fig. 2. Motion JPEG will primarily 
be used for cases in which accurate frame editing is Impor- 
tant such as video editing. 11.261 will be used primarily for 
video conferencing, but it also has potential for use in video 
mail. MPEG-1 and MPEG-2 will be used for publishing, 
where fidelity expectations have been set by consumer ana- 
log video tapes, computer-based training, games, movies on 
CD, and v ideo on demand. 

MPEG Compression 

MPEG has two classes of frames: intiacoded and non- 
intracoded frames (see Fig. :!). Intracoded frames, also called 

i-j'iiiinrs, are compressed by reducing spatial redundancy 

within the frame itself. I-frames do not depend on compari- 
sons with past or "reference'' frames. They use JPEG-type 
compression for still images/' 
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Fig. 3. MPEG frame sequencing. 



Nonintracoded frames are further divided into P-fivmrs and 
B-fiwnes. P-fraines are predicted frames based on compari- 
sons with an earlier reference frame (an intracoded or pre- 
dicted frame). By considering temporal redundancy in addi- 
tion lo spatial redundancy, P-frames can be encoded with 
fewer bits. B-frames are bidirectionally predicted frames 
that require one backward reference frame and one forward 
reference frame for prediction. A reference frame can be an 
I-frame or a P-frame. but not a B-frame. By detecting the 
motion of blocks from both a frame thai occurred earlier 
and a frame That will be played back later in the video 
sequence, B-frames can be encoded in fewer bits than I- or 
P-frames. 

Each frame is div ided into macroblocks of 10 by 10" pixels 
for the purposes of motion estimationt in MPEG compression 
and motion compensation in .MPEG decompression. A frame 
witli only I-blocks is an I-fraine, whereas a P-frame has P- 
blocks or I-blocks, and a B-frame has B-blocks. P-bloeks, or 
I-blocks. For each P-block in the current frame, the block in 
the reference frame that matches ii best is identified by a 
motion vector. Then the differences between the pixel values 
in the matching block in the reference frame and the current 
block in the current frame are encoded by a discrete cosine 
transform. 

The color space used is the YCbt'r color representation 
rather than the RGB color space, where Y represents the 
luminance (or brightness) component, and Cb and Cr repre- 
sent the chrominance (or color) components. Because 
human perception is more sensitive to luminance than to 
chrominance, the Cb and Cr components can be subsampled 
in botli the x and y dimensions. This means that there is one 
Cb value and one Cr value for every four Y values. Hence, a 
l(i-by-l(i maeroblock contains four 8-by-8 blocks of Y. and 
only one 8-by-8 block of Cb and one 8-by-8 block of Cr val- 
ues ( see Fig. 4). This is a reduction from the twelve S-by-8 
blocks (four for each of the three color components) if Cb 

t Monon estimation uses temporal tedundancy lo estimate the movement ot a block from 
one frame to the next 



and Cr were not subsampled. The six 8-by-8 blocks in each 
l(5-by-16 maeroblock then undergo transform coding. 

Transform coding concentrates energy in the lower fre- 
quencies. The transformed data values are then quantized by 
dividing by the corresponding quantization coefficient This 
results in discarding some of t he high-frequency values, or 
lower-frequency but low-energy values, since these become 
zeros. Both transform coding and quantization enable further 
compression by run-length encoding of zero values. 

Finally, the nonzero coefficients of an 8-by-8 block used in 
I he discrete cosine transform can be encoded via variable- 
length entropy encoding such as Huffman coding. Entropy 
encoding basically removes coding redundancy by assigning 
the code words with the fewest number of bits to those co- 
efficients that occur most frequently. 
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Fig. 4. Suhsanipling of the ciiromiiiance components (Ob. <"r) with 
respect lo the luminance (Y) component. 
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MPEG Decompression 

MPEG decompression reverses the functional steps taken 
for MPEG compression. There are six basic steps involved 
in MPEG decompression. 

1. The MPEG header is decoded. Tins gives information 
sucli as picture rate, bit rate, and image size. 

2. The vbfoO data stream is Huffman or entropy decoded 
from variable-length codes into fixed-length numbers. This 
step includes run-length decoding of zeros. 

3. Inverse quantization is performed on the numbers to 
restore them to their original range. 

4. An inverse discrete cosine transform is performed on the 
S-by-8 blocks in each frame. This converts from the frequency 
domain back to the original spatial domain. This gives the 
actual pixel values for I-blocks, but only the differences for 
each pixel for P-blocks and B-blocks. 

5. Motion compensation is performed for P-blocks and B- 
blocks. The differences calculated in step 4 are added to the 
pixels in the reference block as determined by the motion 
vector for P-blocks and to the average of the forward and 
backward reference blocks for B-blocks. 

6. The picture is displayed by doing a color conversion from 
YCbC'r coordinates to RGB color coordinates and writing to 
the frame buffer. 

Methodology 

Our philosophy was to improve the algorithms and tune the 
software first, resorting to hardware support only if neces- 
sary. We set a goal of 10 to 15 frames/s for software MPEG 
video decompression because this is the rate at w hich mo- 
tion appears smooth rather than jerky. 

We stalled by measuring the performance of the MPEG soft- 
ware we had purchased This software initially took two 
seconds to decode one frame (0.5 frame/s) on an older 
50-MHz Model 720 workstation. This decoding was for video 
only and did not include audio. Profiling indicated that the 
inverse discrete cosine transform (step 4) took the largest 
chunk of the execution time, followed by display (step 6), 
followed by motion compensation (step •">). The decoding of 
the MPEG headers was insignificant. 

With this data we set out to optimize every step in the MPE< i 
decompression software. After we applied all the algorithm 
enhancements and software tuning, we measured the MPEG 
decode software again. While we had achieved an order of 
magnitude improvement, the rale of 4 to 5 frames/s was not 
sufficient to meet our goal. 

Hence, we looked at possible multimedia enhancements to 
the basic PA-RISC processor and other system-level en- 
hancements that would not only speed up MPEG decoding, 
but also be generally useful for improving performance in 
oilier computations. In addition, any chip enhancements we 
added could not adversely impact the design schedule, com- 
plexity, cycle time, and chip si/.e of the PA-RISC processor 
we were targeting, the PA 7100LC, which was already deep 
into its implementation phase ai the lime. The PA 7IO0I.C is 
described in delail in the article on page!2. 

We approached this problem by Studying the distribution of 
operations executed by the software MPEG decoder. Then, 



we found ways to reduce the execution time of the most 
frequent operation sequences. The application of algorithm 
enhancements, software timing, and projected hardware 
enhancements was iterated until we attained our goal of 
being able to decompress at a rate greater than 15 frames/s 
via software. 

Algorithm and Software Optimizations 

In terms of MPEG video algorithms, we improved on the 
Huffman decoder, the motion compensation, and the inverse 
discrete cosine transform. A faster Huffman decoder based 
on a hybrid of table lookup and tree-based decoding is used. 
The lookup table sizes were chosen to reduce cache misses. 
For motion compensation, we sped up the pixel averaging 
operations. 

For the inverse discrete cosine transform, we use a faster 
Fourier transform, which significantly reduces the number 
of multiplies for each two-dimensional S-hy-8 inverse dis- 
crete cosine transform. In addition, we use the fact that the 
8-by-8 inverse transform matrices are frequently sparse to 
further reduce the multiplies and other operations required. 

The MPEG audio decompression is also done in software. 
Tliis algorithm was improv ed by using a 32-point discrete 
cosine transform to speed up the subband filtering. 12 

In terms of software tuning, we "flattened" the code to re- 
duce the number of procedure calls and returns, and the 
frequent building up and tearing down of contexts present in 
the original MPEG code. We also did "strength reductions" 
like reducing multiplications to simpler Operations such as 
shift and add or table lookup. 

The last column of Table I shows the percentage of execu- 
tion time spent in each of the six MPEG decompression steps 
after the algorithm and software tuning improvements were 
made. The first two columns of Table I show the millions of 
instructions executed iti each of the six decompression Steps 
and the percent of the total instructions executed (path 
length) each step represents. The input video sequence was 
;m MPEG-compressed clip of a football game. The total time 
taken was 7.45 seconds on an HP 9000 Model 735 09-MHz 
PA-RISC workstation, with 250K bytes of instruction cache 
and 25GK bytes of data cache. 



Table I 

Instructions and Time Spent in each MPEG Decompression 
Step on an HP 9000 Model 735 

Millions of Path Length Time (%) 





Instructions 


(%) 




Header decode 


0.6 


0.1 


0.1 


Huffman decode 


55.3 


10.2 


7.5 


Inverse quantization 


8,7 


16 


2.4 


Inverse discrete 
cosine transform 


20(5.5 


38.3 


38.7 


Motion 

compensation 


79.9 


14.8 


18.3 


Display 


188,7 


35.0 


33.0 


Total 


539.7 


100.0 


100.0 
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The largest slice of execution time (38.7%) and the largest 
chunk of ins I met ions executed (38.3%) were sljil die inverse 
discrete cosine transform. We studied die frequencies of 
generic operations in this group and attempted 16 execute 
them faster. This resulted in new PA-KIS( ' processor in- 
structions for accelerating multimedia software. 

PA-RISC' Processor Enhancement s 

The new processor multimedia instructions implemented 
in die PA 71QGLC processor allow- simple arithmetic opera- 
ndi i.s in be executed in parallel on suhword data in the stan- 
dard integer data path. In particular, the integer ALU is parti- 
tioned SO t hat it can execute a pair of arithmetic operations 
in a single cycle with a single instruction. The arithmetic 
operations accelerated in this way are add. subtract, aver- 
age, shift left and add, and shift right and add. The latter two 
operations are effective in implementing multiplication by 
constants. 

PA-RISC Multimedia Extensions 1.0. The PA 7100LC PA-K'ISi 
processor chip contains some instructions that operate inde- 
pendently and in parallel on two 115-bit data fields within a 
32-bit register. These operations are independent in that bits 
carried or shifted out of one of the fields never affects the 
result in the other field. These operations occur in parallel in 
that a single instruction computes both Hi-bit fields of the 
result. Table II summarizes these instructions. 

HADD does two parallel l(5-bit additions on the left and the 
right halves of registers ra and rb, placing the two 10-bit re- 
suits into the left and right halves of register rt. 

HSUB does two parallel 16-bit subtractions on the left and 
right halves of registers ra and rb. placing the two lfi-bit re- 
sults into the left and right half of register rt. 

Both HADD and HSUB perform modulo arithmetic (modulus 
2 IIJ ), that is, the result wraps around from the largest number 
back to the smallest number and vice versa. Tills is the usual 
mode of operation of twos complement adders when over- 
Dow is ignored. 

HADD and HSUB also have two saturation arithmetic options. 
Willi the signed saturation option. HADD.ss, both operands 
and the result are considered signed lfi-bit integers. If the 
result cannot be represented as a signed lfi-bit integer, it is 
clipped to the largest positive value (2 ir '-l ) if positive over- 
flow occurs, or it is clipped to the smallest negative value 
(-2 1,T ) if negative ov erflow occurs. 

With the unsigned saturation option. HADD. us. the first oper- 
and (ra) is considered an unsigned lfi-bit integer, the second 
operand (rb) is considered a signed lfi-bit integer, and the 
result (in rt) Is considered an unsigned lfi-bit integer. If the 
result cannot be represented as an unsigned lfi-bit integer, it 
is clipped to the largest unsigned value ( 2 "'-1) if positive 
overflow occurs, or it is clipped to the smallest unsigned 
value (0) if negative overflow occurs. 

The signed saturation and unsigned saturation options for 
parallel halfword subtraction are defined similarly. 

HAVE, or halfword average, gives the average of each pah - of 
halfwords in ra and rb. It takes the simi of parallel halfwords 
and does a right shift of one bit before storing each 16-bil 
result into rt. During the one-bit right shift, the carry Is 



Table II 

PA-RISC Multimedia Instructions in PA 7100LC 
ra contains a1; a2 
rb contains bl; b2 
rt contains It; t2 

Instruction Parallel Operation 

HADD ra.rb.rt tl = (al+bl) mod2 16 ; 

t2 = (a2+b2) mod2' 6 ; 

HADD.ss ra.rb.rt tl =IF(a1+b1) > (2 I5 -1| THEN I2 ,5 -1| 

ELSEIF (al+bl) < -2 ,5 THEN |-2 ,5 | 
ELSEIal+bll; 
t2=IF(a2+b2l > I2 ,5 -1ITHEN(2 ,5 -1) 
ELSEIF Ia2+b2) < -2 15 THEN (-2 15 ) 
ELSE (a2+b2); 

HADD.usra.rb.rt tl =IF (al+bl) > (2 I6 -1| THEN (2'M) 

ELSEIF(aUbl) < 0THEN0 
ELSE (al+bl); 
l2=IF(a2+b2) > |2 I6 -1)THEN |2 I6 -1) 
ELSEIF (a2tb2) < 0THEN0 
ELSE (a2+b2); 

HSUBra,rb,rt tl = (al-bll mod2' 6 ; 

12 = Ia2-b2| mod2 16 ; 

HSUB.ss ra.rb.rt tl=IF(a1-bl) > I2 I5 -1)THEN (2 ,5 -l) 

ELSEIF (al-bl) < -2 ,5 THEN(-2 15 | 
ELSEIal-bl); 
t2=IF(a2-b2| > (2 ,5 -l)THEN (2' 5 -l) 
ELSEIF (a2-b2) < -2 ,5 THEN |-2 ,5 > 
ELSE (a2-b2); 

HSUB.us ra.rb.rt tl =IF (al-bl) > (2 I6 -1)THEN (2 ,6 -1) 

ELSEIF (al-bl) < 0THEN0 
ELSE (al-bl I; 
t2=IF(a2-b2| > (2 ,6 -l) THEN (2 ,6 -1) 
ELSEIF (a2-b2) < OTHENO 
ELSE Ia2-b2); 

HAVE ra.rb.rt tl = (al+bl 1/2; 

\2 = (a2+b2)/2; 

HSU ADD ra.k.rb.rt tl = (a1<£k| + bl; 

t2 = (a2<k) + b2; 
(fork = 1,2, or 3) 

HSRkADD ra.k.rb.rt tl = (al ;s»k) + bl; 

t2 = (a2»k) + b2; 
(tor k= 1,2, or 31 

ss = signed saturation uptiun 
us = unsigned saturation 

shifted in on the left and unbiased rounding* is performed on 
the least-significant bit on the right. Because the carry is 
shifted in. no overflow can occur in the HAVE instruction. 

HSLkADD, or halfword shift left and add, allows one operand 
to be shifted left by k bits (where k is I. 2, or 3) before being 
added to the other operand. 

HSRkADD, or halfword shift right and add. allows one operand 
to be shifted right by k bits (where k is 1. 2. or 3). before 
being added to the other operand. 

Both HSLKADD and HSRKADD use signed saturation. 

Unbiased rounding means that the nel difference between the true averages and the averages 
obtained after unbiased rounding is 2ero if the results are eguallv distributed in the result range 
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Saturation Arithmetic. In saturation arithmetic a result is said 
to have a positive overflow if it is larger than the largest 
value in the defined range of the result. It is said to have a 
negative overflow if it is smaller than the smallest value in 
the defined range of the result. If the saturation option is 
used for the HADD and HSUB instructions, the result is clipped 
to the maximum value in its defined range if positive over- 
flow occurs and to the minimum value in its defined range if 
negative overflow occurs. This further spee<ls up the pro- 
cessing because it replaces using about ten instructions to 
check for positive and negative overflows and performs the 
desired clipping of the result for a pair of operations in one 
instruction. 

Saturation arithmetic is highly desirable in dealing with 
pixel values, which often represent hues or color intensities. 
It is undesirable to perforin the normal modulo arithmetic in 
w hic h overflows wTap around from the largest value to the 
smallest value and vice versa. For example, in 8-bit pixels, if 
I) represents black and 25~j represents white, a resull of 2n(5 
should no! change a white pixel into a black one. as would 
occur with modulo arithmetic. lit saturation arithmetic, a 
result of 25(1 w ould be clipped to 255. 

Effect on MPEG Decoding. These parallel subword arithmetic 
operations significantly speed up several critical parts of ihe 
MPEG decoder program, especially in the inverse discrete 
cosine transform and motion compensation steps. More 
than half of the instructions executed for the inverse trans- 
form step are these parallel subword arithmetic instruc- 
tions. Their implementation does not impact the processor's 
cycle time, and adds less than 0.2% of silicon area to the PA 
7100LC processor chip. Actually, the area used was mostly 
empty space around I he AH ', so that these multimedia en- 
hancements can be said lo have contributed to more effi- 
cient area Utilization, rather than adding incremental chip 
area Sec "( iverview of the Implementation of the Multi- 
media Enhancements" on page 66. 

Since the PA 7I()()I,C processor has two integer ALUs, we 
essentially have a parallelism of four halfword operations 
per cycle. This gives a speedup of four times, in places 
where the superscalar AI.I's can be used in parallel. Be- 
cause of (he built-in saturation arithmetic option, speedup of 
certain pieces of code is even greater. 

System Optimization 

The second longest functional Step (see Table I) in MPEG 
decompression was the display step. Here, we leveraged the 
graphics subsystem lo implement (he color conversion step 
together with the color recovery already being done in the 
graphics chip. 7 Color conversion converts between color 
representations in the YCbCr color space and the RGI! color 
space. Color recovery reproduces 24-bit RGB color that has 
been color compressed into 8 bits before being displayed. 
Color compression allows the use of 8-bit frame buffers in 
tow-COSt workstations lo achieve almost the color dynamics 
of 24-bit frame buffers. This leveraging of low-level pixel 
manipulations close to the frame buffer between the graph- 
ics and video streams also contributed significantly lo the 
attainment of real-time MPEG decompression. Color recov- 
ery and the graphics chip are described in the articles OR 
pages ".1 and 13. respectively. 



Other PA 7100LC processor enhancements streamline the 
memory-to-l/O path. By having the memory controller and 
the I/O interface controller integrated in the PA 710OLC chip, 
overhead in the memory-to-frame-buffer bandwidth is re- 
duced i iverhead in the processor-to-graphil SH OOtrolteP 
chip path is also reduced for both control and data. 

Path Length Reduction 

Table III shows the same information as Table I but for the 
low-end Model 712 workstation which uses the midtimedia- 
enhanced PA 7100LC processor and the graphics chip 
mentioned above. 



Table III 

Instructions and Time Spent in each MPEG Decompression 
Step on a Model 712 Workstation 

Millions of Path Length Time(%) 





Instructions 


(%) 




Header decode 


n.ilii 


0.2 


0.3 


Huffman decode 


55.0 


16.1 


145 


Inverse quantization 


8.9 


2.6 


45 


Inverse discrete 


138.5 


40.6 


34.4 


cosine transform 








Motion 


74.8 


21.9 


25.6 


compensation 








Display 


63.0 


18.5 


20.7 


Total 


340.8 


100.0 


100.0 



The Model 712 executes consistently fewer instructions than 
the Model 735 for the same MPEG decompression of Ihe 
same video clip. It is also faster in MPEG decompression 
even though it operates at only 60% of the 99-MHz rate of the 
high-end Model 735 and has only one eighth of the cache 
size. This shows the performance benefits front the path 
length reduction enabled by the PA-RISC processor and sys- 
tem enhancements for multimedia acceleration. 

Performance 

The performance of the PA-RISC architectural enhance- 
ments ;uid the leveraging of the graphics subsystem tor 
video decompression can be seen in Fig. 5. This data is for a 




Model 715 Model 735 Model 712 Model 712 
50 MM/ 99 MHz 60 MH; 80 MHz 



Fig. 5. Mavinuiiii MI'Kti decode frame rales for different models of 
HP 9000 Series TOO workstations. These rates are foro 352-hy-2-li>- 
pixel dip iimi was encoded Bl 30 ftarnesto. 
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Overview of the Implementation of the PA 7100LC Multimedia Enhancements 



One goal in adding the multimedia instructions was to minimize the amount ol 
new circuits to be added to the existing AlUs and to minimize the impact on the 
rest ol the CPU This goal was accomplished The only circuit changes to the CPU 
were in the ALU data path and decoder circuits These instructions reuse most ol 
the existing functionality and very small modifications and additions were re- 
quired to implement them 

All of the new instructions implemented require two 16-bil adds or subtracts to be 
done in parallel The existing ALU adder was modified to provide this functionality. 
These instructions required that the existing 32-bit adder be conditionally split 
into two IB-bit halves without sacrificing the performance of the 32-bit add. Con- 
ceptually this is equivalent to blocking the carry from bit 16 to bit 15 in a ripple- 
carry adder To accomplish this, we made the following modifications. 

The ALU addei is similar to a carry lookahead adder The first stage of the adder 
calculates a carry generate and a carry propagate signal lor each single bit in the 
adder In this case. 32 single-bit generate and 32 single-bit propagate signals are 
calculated These single-bit carry generate and carry propagate signals are used in 
subsequent stages of the carry chain to calculate carry generate and carry propa- 
gate signals for groups of bits. 

The 32-bit adder was divided into two 16-bit halves between bits 15 and 16 by 
providing alternate signals for the carry generate and carry propagate signals from 
bit 16 (Fig 1) The new generate and propagate signals from bit 16 are created 
with a two-input multiplexer When a 32-bit addition or subtraction is being per- 
formed, the multiplexer selects the original generate and propagate signals to be 
passed onto the next stage of the carry chain When 16-bit addition or subtraction 
is being performed Die multiplexer selects the value for generate and propagate 
from the second input which is false (logical 0] for additions and true (logical 1 1 for 
subtractions. 

The new generate and propagate signals can be forced to be false for instructions 
requiring halfword addition This stops the carry from being generated by bit 16 or 
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Fig. 1. Modifications to the carry lookahead adder to accommodate the haltwoid instructions 



1 '16 . 

Force Low 

Fig. 2. Saturation logic There is one of the these circuits for each halfword 

propagating from bit 16 to bit 15, even if this generate and ptopagate signal is nol 
used directly to calculate the carry signal (as is the case in llns adderl. The gener- 
ate and propagate signals can also be lorced to be true for instructions requiring 
halfword subtraction This will force a carry into the more significant halfword of 
the adder by generating a carry from bit 16 into bit 15 This technique is used 
along with the ones complement of the operand to be subtracted to perform sub- 
traction as twos complement addition. 

The original carry generate and propagate signals from bit 16 are still generated to 
calculate overflows from the less significant halfword addition. This overflow is 
used by the saturation logic, which can be invoked by some of these instructions. 

Salutation requires groups ol bits of the result to be forced to states of true or 
false, or passed unchanged This is accomplished with an AND-OR gate (Fig. 21 
The AND function can force the output ol the gate to be false and the OR lunction 
can force the output of the gate to be true. Thus, the output is either forced high, 
forced low, or forced neither high nor low It is never simultaneously forced high 
and low The key is to determine when to force the result to a saturated value. 

The saturation circuit is added at the end of the ALU's data path after the result 
selection multiplexer selects one of the results from the adder after it performs 
additions, subtractions, or logical operations such as bitwise AND, OR. or XOR 
(Fig. 3). The saturation circuit does not impact the critical speed paths of the ALU 
because it is downstream from the point where the cache data address is driven 
from the adder and where the lest condition logic (i.e., logic for conditional branch 
instructions) obtains the results from which to calculate a tesl condition 

If signed saturation is selected, Ihe ALU will force any 16-bit result that is larger 
than 0x7fff to 0x7fff I2 10 -1 ) and any 16-bit result that is smaller than 0x8000 to 
0x8000 |-2' 5 ). These conditions represent positive and negative overflow of 
signed numbers. Positive and negative overflow can be detected by examining the 
sign bit (the MSBI of each operand and the result of the add If both operands are 
positive and the result is negative then a positive overflow has occurred and the 
result m this case is saturated by forcing Hie most-significant bit to a logical 0 and 
Ihe rest of the bits to a logical 1 II both operands are negative and the result is 
positive then a negative overflow has occurred and the result in this case is satu- 
rated by forcing the most significant bit to a logical 1 and the resi of the bits to a 
logical 0. Unsigned saturation is implemented in a similar way. 

The average instruction, HAVE, requires manipulating the result after the addition 
is finished. Before the implementation of the halfword instructions the ALU se- 
lected between the results of a bitwise AND. a bitwise OR, a bitwise XOR. or the 
sum of the two input operands. The halfword average instruction adds an addi- 
tional choice The average result is the sum ol the two input operands shifted 
right one bit position wilh a carry out of the most-significant bit (MSB) becoming 
the MSB of Ihe resulL To perform rounding of the result, the least-significant bit 
(ISBI of the result is replaced by an OR of the two least-significant bits before 
shifting right one bit 

The shift right and add and the shift left and add functions were added by modifying 
the x-bus preshifter m the operand selection logic of the ALU The original ALU was 
capable of shifting 32-bit inputs left by zero, one, two. or three bits To implement 
the 16-bit shift left and add instructions, the left-shift circuits had to be broken at 
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► Address 




fig. 3. Flow of hallword instructions showing the location of the saturation logic m relation to 
the ALU. 

[he halfword boundary This was done by ANDing the bits shifted from [he least- 
significant hallword !o [he mnst-significant hallword with a conltul signal that 
indicates when a 32-bit shift is being done The IB-bit shift right and add instruc- 
tions were implemented by adding the abilily to shift one, two, ui three bits right 
This shift is always broken at the halfword boundary 

One challenging aspect of implementing the 16-hit shift left and add instructions 
was detecting when the results of shifting an operand left by one. two. or three 
bits causes a positive or negative overflow A positive uverflow occurs when [he 
unshdted operand is positive and a logical one is shifted out of the left, 01 when 
the result of the shift is negative A negative overflow occurs when the unshifiecl 
opeiand is negative and a logical 7ero bit is shifted out of the left, ur when the 
result of the shift is positive. These overflow conditions are combined with the 
oveiflows calculated by the addet and used to satuiate the final result The final 
result is saturated if either the left shift ot (he addei causes an overflow 

The result of selecting instructions that can provide the most useful functionality 
while costing the least to implement was a relatively small increase in the area of 
the ALU About 15% of the ALU's area is devoted to halfword instructions. Since 
the ALU's circuits weie the only ones modified on the processor chip, only about 
0 2% of the total processor's chip aiea is devoted to halfword instructions 



video clip that was compressed at 'id frames/s. The Model 
71") ainl Model 735 are based on the PA 7111(1 processor. The 
Mode] 712 is based on Ihe PA 7100LC processor, which is a 

derivative of the PA 7100. The pa 7100LC contains the multi- 
media enhancements and system Integration features anil is 
described in the article on page 12. The older, high-end 
Model 735 running at 00 MHz achieves 1.H.7 frames/s while 
the newer entry-level Model 712 achieves 2l> I'raines/s at GO 
MHz and 33. 1 frames/s at SO MHz. These frame decompres- 
sion rates are quirted for MPEG video only (no audio) with 
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Fig. 6. t 'omparisoii between the performance nf tin- enhanced 
Berkeley MPEG decoder and the HP MPEG decoder (without audio). 

no constraints on how fast the decoding can proceed. In 
other words, the decoding rate is not constrained by the rate 
at which Ihe MPEG stream has been compressed. Hence, 
although the video clip used was MPEG compressed at 30 
frames/s, the 80-MHz Model 712 can decode it faster than 30 
frames/s in unconstrained mode. This implies thai there is 
some processor bandwidth left after achieving real-lime 
software MPEG video decoding. 

In the video player product in HP MPower 2.0. frames are 
skipped if Ihe decoder caiuiol keep up with Ihe desired real- 
lime rale. This results in a lower effective frame rale, since 
skipped frames are not counted, even though execution time 
may have been used for partial decoding of a skipped frame. 

Fig. (i shows a comparison between the enhanced Berkeley 
software MPEG decoder and the IIP software M.PEG decoder 
running on Ihe older HP 0000 Model 720 (with no hardware 
multimedia enhancements) and the newer Model 712 work- 
Station (with hardware multimedia enhancements). The 

fourth column in Pig. 6 illustrates ihe performance obtain- 
able with synergistic software and hardware enhancements. 

In Ihe Model 720. Ihe Berkeley and IIP software decoders 
have comparable performance. For the Model 712. the per- 
formance of Ihe HP decoder was 2.4 limes greater I ban Ihe 
Berkeley decoder because of Ihe synergistic coupling of the 
algorithms and software optimized with the PA-RISl ' multi- 
media instructions and the system-level enhancements in 
Ihe Model 712. 

Fig. 7 shows the performance when MPEG audio of various 
fidelity levels is also decompressed by software pinning on 
the general-purpose PA 7100LC processor. The highest-fidel- 
ity audio is stereo with no decimation. This means thai 
every audio sample comes as a pair of left and right channel 
values, and every sample is used. Half decimation means 
that one out of every two audio samples is used. (3/4 deci- 
mation means thai only, one out of every four audio samples 
is used.) Mono means that every audio sample is a single 
value (channel) rather than a pair of values. 

While soil ware decompression of MPEG audio degrades the 
performance in terms of frames decoded per second, the PA 
710(l|,( -based workstations achieved rales of 15.1 I'raines/s 
at 60 MHz, 21.2 frames/s at Sii MHz, and 27. 1 frames/sal 
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| Stereo |J Mono with 1/2 Decimalion No Audio 

Fig. 7. Performance when MPEG video end MPE(i audio are decoded 
in software. 

101) MHz even with (lie highest-fidelity -l-l.l-kllz stereo 16-bit 
linear audio formal wilh no decimation. With fuitlier en- 
hancements of audio decoding and audio-video synchroniza- 
tion, we should be able to do even better. 

Conclusion 

We wanted a software approach to MPEG decoding because 
we felt that il" video is to be useful it has to be pervasive, and 
to be pervasive, it should exist at the lowest incremental 
cost on all platforms. With a software video decoder, there is 
essentially no additional cost In addition, the evolving stan- 
dards and improving algorithms pointed to a flexible solu- 
tion, like software running on a general-purpose processor. 
Using special-purpose chips designed for MPEG decoding, 
or even for JPEG, MPEG, and H.261 compression and de- 
compression, would not allow one to lake advantage of im- 
proved algorithms and adapt to evolving standards without 
buying and installing new hardware. 

Furthermore . since the performance of general-purpose 
microprocessors continues to improve with each new gen- 
eration, we wanted to be able to leverage these improve- 
ments for multimedia computations such as video decom- 
pression. This approach also allows us to focus hardware 
design efforts on improving t he performance of the general- 
purpose processor and system without having to replicate 
performance efforts in each special-purpose subsystem, 
such as the graphics and video subsystems. The PA-RISC 
multimedia instructions are also useful for graphics, image, 
and audio computations, or any computations requiring 
arithmetic on a lot of numbers with precision less than 16 
bits. 

The net result is that we achieve real-time MPEG decoding 
of video streams at 30 franies/s with a software decoder. 
This was achieved by a synergistic combination of algorithm 
enhancements, software tuning, PA-RISC processor multi- 
media enhancements, combining video and graphics support 
for color conversions and color compression, and system 
tuning. The PA-RISC multimedia enhancements allow paral- 
lel processing of pixels in the standard integer data path at 
an insignificant addition to the silicon area. The total area 
used is less than 0.2% of the PA 7100LC processor chip with 
no impact on the cycle time or the control complexity. 

The real-time software MPEG decoding rate of the final 
video player product exceeds our original goal of 10 to 15 



frames/s for a soft ware-based MPEG video decoder. It is 
also significant that MPEG video decoding at 30 frames/s is 
achieved by an entry-level rather than a high-end work- 
station. This is in the contexl of a full-function video player 
on the HP MPower2.0 product With MPEG audio decoding 
(also done by software), the frame rate is usually above 15 
frames/s, even for the low-end Model 712/60 workstation, 
and around 24 frames/s for the Model 712/80 workstation. 

We expect to see continuous improvement in the MPEG 
decoding rate as the performance of the general-purpose 
processors increases. With PA-RISC processors, there has 
been roughly a doubling of performance every 18 to 24 
months. This would imply that larger frames sizes, multiple 
video streams, or MPEG-2 streams may be decoded in the 
future by such multimedia-enhanced general-purpose 
processors. 
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HP TeleShare: Integrating Telephone 
Capabilities on a Computer 
Workstation 



Using off-the-shelf parts and a special interface ASIC, an I/O card was 
developed that provides voice, fax. and data transfer via a telephone line 
for the HP 9000 Model 712 workstation. 

by S. Paul Tucker 



Integration of the telephone and the computer workstation 
is a natural step in the evolution of the electronic office. It 
allows the user to perform telephone transactions without 
having to change from the keyboard and mouse environ- 
ment to the telephone and handset environment and vice- 
versa. This capability provides obvious benefits to a wide 
customer audience, especially those dealing with customer 
senice and support. The HP TeleShare option card lor the 
HP 9(10(1 Model 712 workstation represents HP's first inte- 
grated telephony product. Coupled with multimedia technol- 
ogies such as audio, \ideo. and HP SharedX, 1 HP TeleShare 
provides the user with a powerful arsenal of communica- 
tions tools. This article will focus mainly on the hardware 
aspects of the HP TeleShare product. 

Features provided by III' TeleShare include: 

Two-line support, with each line configurable for voice, fax. 

or data 

Workstation audio support and mixing (stereo headset with 
built-in microphone included) 

Dual-lone iniiltilic<|ueney (DTMF) tone generation and 

detection 

Telephone line status and control 

t ail progress support 

Caller-ID support 

V.32bis modem ( 14,400 bits/s ) with V.42bis and MNP5 
(Microcom Networking Protocol) compression and error 
correction 

Fax Group 3 Class II up to 14.400 bits/s. 
Background 

IIP TeleShare began as an experimental interface card for 
the IIP 9000 Series 300 workstations. It had Simple voice- 
only telephone capabilities, including single-source audio 
record and playback, and it was perceived as useful (and 
entertaining) to those engineers who were fortunate enough 
to have the opportunity to use it. At some point, fun her in- 
vest iu.it Ion was needed and the III' TeleShare project team 

was formed It was determined thai fax and data modem 
capabilities were needed with close coupling to the work- 
stations audio capability. Dual telephone lines were included 
so the user could talk on one line and at the same time use 
the other line for faxing or data. The Standard analog phone 
line Interface was chosen over digital (i.e.. ISDN) because of 



the relatively insignificant number of digitally equipped PBX 
systems. 

The first incarnation of the current product was an external 
RS-232-driven box with stereo inputs for computer, line-in, 
and microphone audio, and stereo outputs for computer, 
line-out, and headphones. It provided dual-line operation 
and employed two DSP (digital signal processor) subsys- 
tems for maximum flexibility and perfonnance in voice and 
data modes. Audio mixing was provided by dedicated analog 
hardware, and any combination of audio inputs could be 
sent to any output complete with treble and bass control. 
The audio capabilities were so good that IIP TeleShare engi- 
neers always had their CD players plugged into the box and 
their headphones on. This forced development of an auto- 
matic- audio mute feature when an incoming call was de- 
tected. Since modem functionality was a primary goal, the 
command interface for the box used a partial modem AT 
command set, along with some proprietary extensions Tot- 
new functionality like selling audio gains, setting audio mix 
values, telling a DSP to reboot, and so on. These commands 
were delivered over the HS-232 interface and received the 
typical OK and error responses. 

While the external box was well received In the lab and by 
customers, it was postponed indefinitely in favor of a lower- 
Cost internal version with a proprietary Interface available on 
a single workstation, the Model 712. The same DSP subsys- 
tem used in the external box was carried over to the Model 
712 option card and some effort was exerted to leverage as 
much as possible of the external box's software interface 
and feature set into the new design. 

Architecture 

IIP TeleShare is made up of two independent DSP subsys- 
tems that communicate with the workstation host through 
an interface chip called XBAR (see Fig. 1 ). Each DSP is 
coupled to a hybrid chip called a ilnlii tirrrss armilfirment, 
which provides direct connection to a standard two-wire 
analog telephone line. HP TeleShare is lightly coupled with 
the workstation audio system lo provide the highest degree 
of audio flexibility. For instance, line-in audio (perhaps from 
a CD player) could be sent to telephone line 0 while the 
party on that line is on hold. Simultaneously, the workstation 
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user could be conversing With or faxing a message to the 
party on telephone line 1. During this lime comments from 
the party on line 0 could be recorded to disk for later 
playback. 

XBAR. The XBAR ASK" (application-specific integrated 
circuit) is a custom VLSI part packaged in a low-cost 80-pin 
QFP (quad flat pack). It was designed by the HP TeleShare 
team specifically for use with the Model 712 workstation 
and performs all of the interface functions required by the 
MP TeleShare card. The XBAR chip communicates with 
the system I/O chip (LASI) and the audio CODEC (coder/ 
decoder ) through a pair of proprietary serial interfaces. If 
HP TeleShare is not preseni in the system, audio data and 
CODEC control words pass through ihe bidirectional buffer 
between the system I/O chip and the audio CODEC. When 
Ihe HP TeleShare card is installed, XBAR is effectively 
placed between the system I/O chip and the CODEC, forcing 
all audio lo be routed through XBAR in either direction. The 
serial interface between XBAR and Ihe audio CODEC carries 
l(>-bit stereo audio data. 

The serial interface between XBAR and the I/O chip mulli- 
plexes 16-09 system audio data ( to and from disk ) and con- 
trol words for XBAR. In addition, this interface is used for 
modem data, voice-mode AT commands and responses, and 
DSP application code downloaded from the host system. On 



Ihe DSP side. XBAR has two 13.821-MHz serial ports, each 
designed specifically for interfacing with the DSPs. These 
pons are used for passing audio samples, modem data, ap- 
plication programs, and commands and responses to and 
from the DSPs. 

XBAR*s configuration can be changed by writing to the 
control registers in the XBAR-to-LASI I/O serial interface 
address space. In voice mode, XBAR is configured to pass 
each audio data sample from the LASI I/O chip and CODEC 
(coder/decoder) to a DSP, whereupon the DSP will return 
responses to each of those sources. XBAR can also send 
audio data samples from DSP to DSP for conferencing be- 
tween lines when both lines are in voice mode, hi dala and 
fax modes, XBAR sends appropriately formatted data to the 
DSP and receives similar data in return. 

Although XBAR supports stereo audio at up lo a 4S-kIIz 
sample rate. DSP bandwidth limitations require all audio 
data to and from the telephone lines to be left -channel only, 
sampled at 8 kHz. This is not a serious limitation, since tele- 
phone-quality audio only requires a sample rate around 7.2 
kHz for full reproduction and is inherenily a single-channel 
signal. 

In addition to the DSP serial ports, XBAR also has a pair of 
byte-wide parallel ports that connect to the DSPs' boot ROM 
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pons. Tliis allows DSP boot code (noi lo be confused with 
DSP application code ) to be downloaded from the host sys- 
tem. This provides additional flexibility and eliminates the 
cost and board-space limitations associated with external 
ROMs. 

XBAR has several asynchronous control signals that are 
connected to downstream hardware, including reset lines 
for the DSP subsystems and hook control and telephone 
status (such as the ring indicator) signals to and from the 
data access arrangement chips. 

The biggest challenge in XBAR s design was purely logistical 
because we had a lot of different <lata types (e.g.. stereo data 
and telephone command and status data) to handle and very 
little time to implement them in XBAR. There are no less 
than 52 separate data types that XBAR must recognize and 
generate for the two DSP serial interfaces alone, with a 
slightly smaller number required for the system 1/0 interface. 
To provide these types, each transaction between XBAR and 
a DSP consists of 16 bits, with the upper eight bits providing 
type information and the lower eight bits providing the asso- 
ciated data. To prevent data overruns, XBAR requires a data 
acknowledge (Ack) word back from the appropriate DSP for 
every transaction. 

Audio data samples in IIP TeleShare are 16 bits long. Since 
XBAR sends eight bits of data at a time, audio samples must 
be broken into two pieces: an upper half, or most -significant 
byte (MSB) and a lower half, or least-significant byte (LSBj. 
Using this model, it requires two transfers from XBAR and 
two Acks back to send one sample of audio data to a DSP. 
Sending one set of stereo system and line-in audio samples 
to a DSP requires eight output transfers ( four transfers for 
the system sample and four transfers for the line-in sample), 
with an Ack back after each transfer. The DSP will then send 
mixed audio samples back for the system and the CODEC, 
requiring an additional eight transfers, for a total of 21 trans- 
fers per sample. This has to happen at an 8-kIIz sample rate 
(once every 125 microseconds). Fortunately. XBAR can han- 
dle these transactions, but order must be maintained exactly 
or audio quality will suffer. < llher data types, such as AT 
commands and responses, are given lower priority during 
audio frames and are queued until audio transfer is finished. 

Digital Signal Processor. The DSP used by TeleShare is an 
Analog Devices ADSP2KI1. This is a programmable single- 
chip microcomputer optimized for digital signal processing, 
and operates at 16.67 MHz. The 2101 operates on 16-bit dala 
and uses a 2 -1-bit instruction word It has 1021 winds of data 
RAM and 2018 words of program RAM on the Chip. The part 
has two data address generators and a program sequencer, 
which allows program and data accesses to occur simulta- 
neously in a single cycle. Dual data operand fetches can also 
occur in a single cyc le since program memory can also be 
used to store data. The part can address up to I6K words 
of data and I6K words of program memory, both of which 
are supplied on the HP TeleShare board in the form of six 
external SRAMs. 

Tin- DSP has two independent serial ports, SPORTO and 
SP0RT1. which support multiple data formats and frame rales 
and axe fully programmable. In the IIP TeleShare design, 
SPORTI on each DSP is dedicated to communication with 



XBAR. while SPORTO is dedicated to communication with an 
AD28msp01 telephone line CODEC- Each full transfer to or 
from one of the SPORT lines triggers an associated interrupt 
in the 2101. allowing programs to act on the incoming data 
as it arrives. 

Analog Devices supplies a complete set of software develop- 
ment tools for the 2100 microprocessor family, including a C 
compiler with a DSP function library. 

CODEC. The Analog Devices AD28msp01 provides HP Tele- 
Share with a multiple-sample-rate CODEC specifically de- 
signed for use in modem designs. This device supports sam- 
ple rates of 7.2 kHz. 8.0 kHz, and 9.6 kHz. and has an 8/7 
mode* for sampling at 8.23 kHz. 9.14 kHz and 10.97 kHz. For 
v oice mode operation, 7.2 kHz and 8.0 kHz are all that is 
required, but the sophisticated algorithms used by modem 
modem standards often require the other rates. The CODEC 
uses 16-bit sigma-delta conversion technology and includes 
resampling and Interpolation filtering along with transmit 
and receive phase adjustments. 

Each CODEC has one serial port which is connec ted directly 
to SPORTO on the associated DSP. Tliis port operates in free- 
running mode once it is properly initialized and continually 
sends 1 (5-bit data samples from the telephone line to the 
DSP. All transfers to the DSP consist of a serial output frame 
sync followed by a 16-bit address word, then a second frame 
sync followed by a 16-bit data word. These address and data 
pairs are transmitted at the selected sample rate and trigger 
SPORTO receive interrupts in the DSP. The DSP transfers data 
to the CODEC using the same mechanism as just described 
(in the other direction, of course). The address portion of 
each transfer coming from the DSP identifies the data as a 
control word ( for programming the pari ) or as a data word 
to be sent through the on-chip digital-to-analog converter 
( DAC) and transmitted to the data acc ess arrangement chip. 
I >ala from the < ' M )E( ' to the DSP is identified as either a 
Control word or as a dala word from the on-chip analog-to- 
digital converter (ADC). 

The AD2Smsp(ll CODEC Ls attached to the telephone line 
through the data access arrangement c hip. Transmit data 
outputs are differential for noise reduction, while the re- 
ceive data input is single-ended. 

Data Access Arrangement. HP TeleShare uses the TDK 
7-SM9002 data access arrangement chip as its telephone line 
interface. This part provides all the necessary line monitoring, 
filtering, isolation, protection, and signal conversion functions 
for connection of high-performance analog modem designs to 
the PSTN (public switched telephone network) in the I "niled 
States, Canada, and Japan. The 73M9002 incorporates, on a 
two-to-four-wire hybrid, ring detection circuitry, off-hook 
relay, and on-hook line monitoring for caller-ID support (see 
"Caller-ID" on page 72). 

The 73M9002 comes with FCC (Federal Communications 
Commission) part 68 DOC CS-0:( and JATE (Japan Approvals 
Institute for Telecommunications Equipment) protection 
Circuitry buill in, and is compliant with VL 1459 2nd Edition 

' The 8/7 mode is a capability requited by some modem applications It simply adds some 
sampling bandwidth I ot exumple, in 8/7 mode ttie noimal 8 kHz sample iale becomes 9. 14 
kHMB*B/7) 
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Caller-ID 

Caller-ID information is senl between the first and second power ringing signals 
The data is sent a minimum of 500 milliseconds after the first ring and ends at 
least 200 milliseconds before the second ring begins This leaves 2.9 to 3.7 seconds 
of time for data transmission The data is sent at 1200 baud using frequency shift 
keying |FSK| modulation All data is 8-bit ASCII 

Two standard formats exist for Caller-ID information: single message formal and 
multiple message format In general, both lorrnats can be described using Fig 1. 

The message type is 0x4 (hexadecimal 4| for single message format. The message 
length is variable and indicates the number of message words in the message 
body The final word is a checksum word, used for error checking Single message 
formal provides the receiver with date, time, and calling number data. 

The message type is 0x80 (hexadecimal 80) for multiple message format The 
message length is variable as before, but provides the receiver with date, time, 
calling number, and calling name data if available In the absence of calling name 
data, a P indicating private or an 0 indicating out of area or unavailable will be sent 

Caller-ID detection requires on-hook line monitoring, which the HP TeleShare data 
access arrangement chip fully suppurts HP TeleShare can detect and display both 
message formats. 
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Fig. I. Caller-ID message format 



with the addition of an external slow-blow fuse. The off- 
hook relay is controlled by a TTL-lcvel input from XBAR. 
This relay determines when I he phone is off I he hook and 
can be pulsed for use as a pulse dialer. Another TTL-level 
input from XBAR is used to enable on-hook line monitoring. 

The ring detection circuitry is capable of detecting ringing 
signals that comply with Ringing Type B from the FCC Part 
68 regulations. The detected ring signal appears at a pair of 
differential outputs which are also connected to XBAR. 

The 73M9002 prov ides telephone connectivity to the DSP 
subsystem through the CODEC'S analog receive and trans- 
mil lines and is attached to the telephone line through a 
standard RJ-14 connector. 

Operating Modes 

IIP TeleShare is capable of operating in three modes: voice, 
fas. and data. The modes are selected through a graphics] 
user interface by the workstation user. The mode applica- 
tion software is downloaded through XBAR to a DSP as 
needed and runs continually tuitil a reset of thai DSP is per- 
formed. The voice mode code was developed solely by HP 
and can be run on one or both lines simultaneously. The fax 
and data modem code was developed with a third party and 
because of Licensing restrictions, only one lute can he con- 
figured as a fax or data modem at a time. Combinations of 
voice and fax or data are fully supported. 



Voice Mode Operation. When configured in the voice mode. 
IIP TeleShare essentially operates like an enhanced tele- 
phone. Digital mixing of microphone, line-in. telephone, and 
recorded audio (from system disk) is supported for both 
playback and recording. This capability allows numerous 
interesting audio configurations including placing a line on 
hold with music, recording conversations, playing back re- 
corded audio over the phone, and so on. While in voice 
mode, HP TeleShare provides the user with caller-ID infor- 
mation if it is available. In addition. DTMF (dual-lone multi- 
frequency) tone and pulse dialing are supported, along with 
DTMF tone detection for unattended phone (Junctions like 
answering machines or voicemail (see "Call Progress. DTMF 
Tones, and Tone Detection" on page 73} 

Dialing and hook manipulation actions are performed through 
the (il l (graphical user interface), but at the lowest level 
these actions are senl to the DSP as standard AT commands 
like ATDT (attention dial tone) and ATH n (n = 0 is on-hook or 
hang up, and n = 1 is take telephone off hook). Special func- 
tions like audio mixing are also controlled with low-level 
AT-type commands, bul are manipulated using sliders in the 
GDI. 

The voice mode application firmware is driven primarily by 
DSP SPORT interrupts. Every incoming 16-bit SPORT0 word 
from XBAR triggers an interrupt, which in turn causes the 
SP0RT1 interrupt service routine to execute. Likewise, every 
16-bit SP0RT0 word from the CODEC causes the SPORT0 inter- 
rupt service routine to execute. The SP0RT1 interrupt service 
routine is responsible for audio I/O with XBAR and queueing 
AT commands as they arrive. Commands arrive asynchro- 
nously, that is, they can arrive at any time, while audio ar- 
rives in 8-piece bundles every 125 microseconds (one frame) 
as described earlier. Normally, every piece of dala received 
by SPORTl causes an interrupt, but the firmware disables 
these Interrupts for the rest of a frame once it recognizes the 
first piece of audio data. Otherwise, at least eight context 
switches would occur every frame, which would render the 
system useless. Once the SPORTl interrupt service routine 
has received all of the audio samples, it is responsible for 
transmitting the new audio back to XBAR for routing to the 
workstation (i.e., headphones and/or disk). 

Thfi SPORT0 interrupt service routine is responsible for receiv- 
ing and transmitting telephone-line audio and mixing all audio 
data, including DTMF tones. Before mixing can occur in the 
DSP. all of the LSBs must be appended to the MSBs. Remem- 
ber that each 16-bit sample transferred between XBAR and 
the DSP is divided so that the most-significant byte contains 
the data type and the least -significant byte contains the data. 
Thus, all lite data from XBAR is put back into I6-bii linear 
formal before transfer to the CODEC. 

The audio input and output amplitude matrices, built by the 
user via the GL'I. are used to determine what the final mix 
will sound like. The DSP firmware processes each Output in 
sequence by adding together any inputs that are on to create 
a total value for each output. Any gain adjustments are made 
at this time as well. When this is completed for all outputs, 
the resulting 16-bit values are broken into MSBs and LSBs. if 
required. 

Audio data I hat is meant for XBAR is transmitted during the 
next XBAR audio frame. Audio data meant for output to the 
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Call Progress, DTMF Tones, and Tone Detection 



HP TeteSbare's voice mode firmware has the ability to detect a number of tones 
used commonly in telephone communications, including OTMF tones and cad 
progress tones like busy, nngback (the nnging sound you hear when you call 
someone I. and dial tone 

DTMF Tones 

Dual-tone rnultifrequency (DTMF1 tones are made up ot two separate tones, as the 
name suggests, and can be accurately generated using easily understood principles 
The DTMF standard specifies two sets ot distinct tones, called row frequencies 
and column frequencies (see Fig II Trie row frequencies correspond to the hori- 
zontal rows on a standard telephone touchpad The column frequencies correspond 
to the vertical columns on the touchpad. plus an additional column to the right of 
the last touchpad column 

This makes eight separate frequencies, which combine for a total of sixteen DTMF 
tones (see Fig. 2) 

Generation of a DTMF tone is accomplished by creating a sinusoid for each of the 
two frequencies, row and column, and then adding the results In a digital imple- 
mentation, the sinusoids are computed and added on a sample-by-sample basis. 
HP TeleShare uses a five-coefficient Taylor series approximation for the sinusoid 
generation The sinusoid samples are updated and added at 8 kHz. or every 125 
microseconds, and the sum of the sinusoid samples is used as the current DTMF 
sample. 

Tone Detection 

Tone detection is accomplished through the use of a 512-point fast Fourier transform 
(FFT), which is implemented in the ADSP2101 C-language run-time library The FFT, 
when given a set of samples of an input signal over some time interval, returns the 
frequency spectrum of the signal during the interval This can be done m almosl 
real time with a DSP, making it very useful for delecting incoming tones The 
following important rules and relationships should be noted concerning sample 
rate, input poinis, output points, lime, frequency, and the FFT in general 

•The FFT requires complex (real and imaginaryl data tor mpul (two arrays) 

•The imaginary input array may be filled with zeros if unused 

•The output data is complex (two arraysl 

•The frequency spectrum returned covers half of the sampling frequency. 
•Only ihe firs! half of ouiput data is used, and the other half is a mirror image 
• Ihe ouiput frequency resolution is equal lo (sampling tatel/lnumber of input points! 

Using an B-kHz sampling rale and 512 points causes the FFT to return a spectrum 
from 0 lo 4 kHz. with 512 complex ouiput points The second 256 outpul points 
can be ignored since Ihey are the mirror image of the first 256 The ouiput will 
have a resolution of 15.625 Hz per point, using Ihe formula above These output 
points will be referred to as bins since they include spectral data on either side of 
each point 

HP TeleShare calculates magnitude-squared values for each bin by squaring the 
teal and imaginary values at each point and adding them The magnitude-squared 

Column Frequencies 
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Rg. 2. Call progress and DTMF tunes 'ecognifed by HP TeleShare. 

values correspond roughly to the power of the signal in each bin. Once the powers 
are known for each bin in the spectrum, they can be analyzed to see if any DTMF 
or call progress tones are present. 

As an example, suppose Ihe telephone has been taken off-hook in preparation for 
dialing and HP TeleShare is configured to check foi dial tone 512 samples of the 
input signal would be stored in the real input array, while the imaginary array is 
filled with zeros Next, the FFT function is called, returning ihe real and imaginary 
arrays The magnitude-squared values of the first 256 bins are compuied using the 
two outpul arrays. The two frequencies thai make up a dial tone are 350 and 440 Hz 
(see Fig 21. so FFT indexes (or bin numbers) must be computed for these frequencies: 



350/15 625 = 22.4 



440/15.625 = 28 16 



Fig. ]. Oual-tnne.inultifrequency digits and l'ie Ireuuiincies associated with them 



An effective method of checking lor the existence of a particular frequency is to 
compare Ihe power present at that frequency with the total power of Ihe spectrum 
This is done quite easily with magnitude-squared values since they represent 
power in each bin already Total power is simply ihe sum of all ihe magnitude- 
squared values for the lirsi 256 FFT return values Divide this into the power ol ihe 
frequency being checked for. and ihe result is the percentage of total power fur 
that frequency. For example, when checking fnr 350 Hz, compute the sum of Ihe 
power values for bins 22 and 23 since the real index (22 4) falls between them, 
and then divide by ihe lotal power The result is the percentage of the total power 
present around 350 Hl The same can be done for 440 Hz, using bins 28 and 29 

Once the percentage of lotal power is calculated, a comparison can be made to see 
il tne power in each frequency meets match criteria. The HP TeleShare firmware 
typically uses 35% ol total power as a match condition In other words, if the power 
present al Ihe desired frequencies is 35% or more of the total power, dial lone has 
been detected Otherwise, no dial tone is found 

The number of bins used in Ihe comparison and the match criteria can be fine- 
tuned for a particular application The match criteria can include other tests and 
can be relaxed or tightened as needed The number of bins used can be influenced 
by the total number of points in the FFT and by a preprocessing tool that does 
windowing Windowing is used to create a finite-length sequence from a continu- 
ous sequence. It is basically a digital filler thai truncates an infinite-length input 
sequence while preserving its trequency characteristics Since we are grabbing 
Mule pieces (sequences) of data, we need to window the data. 
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telephone line is immediately sent to the CODEC without 
being split in half. Since both interrupt service routines run at 
8 kHz. there is no need to worry about sample rate changes. 
DTMF audio data is only available for mixing when a tone is 
being generated. A new DTMF sample is generated during 
every SPORTO interrupt and is based on the sample rate (al- 
ways 8 kHz) and the lime elapsed since the lone began. 

All of these interrupts and audio manipulations require al- 
most all of a DSP's processing band widt h and can effect 
some areas or system performance. Because of DSP band- 
width limitations. DTMF detection can have a slight, but 
noticeable effBCt on the audio quality heard by the user. 
However, in unattended modes like answering machines or 
voicemail (where DTMF detection could be used for such 
things as navigation), this should not be a concent. The de- 
fault configuration has DTMF detection disabled, since the 
typical user will never use it, and the current GUI does not 
support it. 

Fax and Data Modem Operation. The lax and data modem 
functionality was codeveloped by HP and Digicom Systems 



Incorporated and uses Iheir SoftModem tecluiology. The fax 
mode allows transfers up to 11.400 bits/s and covers Group 
3 ( 'lass II and all fallbacks. Data mode supports transfers up 
to 14,400 bits/s (V.:(2bis) and can reach peak rates of 57.600 
bits/s with compression. 

Conclusion 

HPTeleShare effectively combines telephone communica- 
tions capability with a tow-cost computer workstation. Con- 
text switches between the display and the telephone ;irc 
minimized by integrating the telephone into the computer 
system and providing an easy-to-use graphical user inter- 
face. Voice. Tax, and high-speed data modes are supported 
using flexible digital signal processing technology. 
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Product Design of the Model 712 
Workstation and External Peripherals 

A product design without fasteners and the use of environmentally 
friendly materials and low-cost parts with integrated functions provides 
excellent manufacturability, customer ease of use. and product 
stewardship. 



by Arlen L. Roesner 



The III' ilOOO Model 712 workstation and and the three new 
peripherals that go with the product are an excellent exam- 
ple of computer integration and simplicity. The new work- 
station, while providing a new class of performance with 
HP's new PA-RISC PA 7100LC proc essor, pushed the enve- 
lope of product design by using relatively few and inexpen- 
sive parts. In addition to simplicity and low cost, the product 
promotes good product stewardship by making parts easy to 
identify and recycle. Customers find the hardware easy to 
manage because there are no fasteners lo deal wild, and all 
the components snap or drop into place. The main work- 
station product is a small compact size that fits easily Under 
a monitor or stands vertically on Ihe desk, and the external 
peripherals can be positioned on Ihe desktop where they are 
most convenient tO ihe user. Fig. 1 shows the Model 712 
workstation and its three peripherals. 




Model 712/60 



Fig. [.HP 9000 Model 712 workstation and related external 

peripherals. 



Outward Simpliiit > 

Several assemblies of the Model 712 workstation products 
have high levels of functional integration. This functional 
integration tends to make components more complex, but 
yields an outer simplicity by reducing the number of physi- 
cal parts and the methods necessary to work with them. 
Once configured, the only accessible components of the 
Model 712 workstation Include Ihe chassis, system board. 
Option boards (including memory), disk drive, flexible disk 
drive, and top cover. All of these components are accessed 
through quick removal of the cover and the manipulation of 
a few snap or drop-in fits, which require a minimum of lime 
anil effort. Fig. 2 shows the workstation and one of ihe pe- 
ripherals with their covers removed. Benefits of this result- 
ing simplicity include better manufacturability, easier cus- 
tomer use and configuration, and serviceability. 




Fig. 2. Tlir Muriel 712 worksiatinii .mil hard disk peripheral with top 
rovers disassembled 



© Copr. 1949-1998 Hewlett-Packard Co. 



April lOBSHewlefl PackardJournal 75 




Fig. 3. Top view nl'llic MimIW 712 willnnit toji cnvi-r 
Electronics 

The system electronics is the place where integration is most 
likely to he first noticed in the Model 712 product Electronic 
assemblies consist of one main system hoard, a power supply, 
three optional circuit hoards, and up lo four memory 
SIMMs. The main system hoard is relatively small, and all of 
the core electronics is incorporated onto (his hoard through 
integration Of functionality into relatively few VLSI compo- 
nents. (Fig. 3 in the article on page it shows the main system 
board). The main system hoard uses dual-sided surface 
mount construction, with I/O connect or space being pro- 
vided mostly by double-high (stacked) bulkhead connectors. 
( )ptional hoards are provided for telephony, extra I/O. and 
high-resolution graphics. Compared lo todays personal 
computers, the Model 712's system board functions are usu- 
ally found on a personal computer's motherboard, back- 
plane (if any), and two to three expansion boards. This level 
of integration on the Model 712 exceeds the density of per- 
sonal computer functionality, while providing current work- 
station performance. 

Chassis 

The chassis assembly consists of a plastic base, a metal 
chassis, a metal liner for EMI containment of the rear I/O 
connectors, and a plastic rear dress panel I see Fig. 4). The 
dress panel includes silkscreened graphics to identify the 
connectors and stale necessary regulatory information. 
Eliminating the need for information labels. The chassis has 
a variety of holes and embossments lo assist in joining the 
plastic parts to it. The plastic base provides outer air venting 
and cosmetic appeal lo the product while also containing 
several snaps and guides for mating parts, the metal liner 
provides EMI linger contact lo all connectors in one pan. 
whereas previous products often required many different 
clips for such functionality. Held together via plastic heat 
stakes, the plastic base, the metal chassis, the metal liner, 
and the plastic dress panel make up the main assembly 



chamber of the product. The main circuit board, power sup- 
ply and cover, disk brackets, and top coverall snap or drop 
into this chassis, t (pliou hoards are also easily installed into 
the chassis on lop of the main system hoard, with integral 
bulkheads thai male vertically lo chassis cutouts (also with- 
out fasteners). 

Power Supply Cover 

The power supply cover is another example of integration. 
Many pails were "designed out" by this single plastic part 
that performs six functions. The main function is to protect 
end users from dangerous voltages by shrouding the exposed 
power supply. The cover snaps into the chassis from front to 
rear and is removable only by using a screwdriver to disen- 
gage Ihe snap that holds i( in place, hi addition to shrouding 
the power supply, the cover secures the power supply hoard 
in place, houses the fan and speaker, channels air How. and 
provides structural support for the monitor. The fan simply 
Snaps down inside Ihe cover and seals to the sides and lop 
of the cover. The speaker slides down and press Ills into a 
simple pocket, which provides aCOUStiC baffling. After Ihe 
cover is installed, cables from these devices arc routed lo 
the main System hoard for electrical connection. 

HP-PAC Disk Brackets 

The disk brackets are made of HP's newly patented HP-PAC 
material. 1 This material is made of expanded polypropylene 
beads, and is used most often to produce shipping carton 
cushions for many types of products. Instead of placing this 
material around a finished product to cushion it in a ship- 
ping canon environment, il is instead formed to fit inside a 
product with integral recesses to embed internal compo- 
nents. For Ihe Model 712 workstation, the HP-PAC material 
is used lo hold the hard disk and flexible disk mechanisms 
in place. The HP-PAC used in the workstation consists of 
three parts: a bottom shell which provides a recess for both 
flexible and hard disk, and two separate top pieces for cover- 
ing each disk mechanism (see bottom portion of Fig. 4). 
Because of the cushioning properties of Ihe HP-PAC mate- 
rial, the disk drive mechanisms benefit from reduced shock 
and vibration levels. The HP-PAC material also provides 
integral air channels for inlet air to be drawn across hot 
areas of Ihe disk drive mechanisms. The interesting feature 
of HP-PAC is that no screws are needed to install the mecha- 
nisms. The devices simply drop into recesses inside of the 
cushioning material, and cables can be connected directly lo 
Ihe embedded mechanisms. Once in place. Ihe chassis en- 
closure then retains the top and bottom shells of HP-PAC 
around eac h device. 

Top Cover 

The lop cover includes a configurable bezel for Ihe flexible 
disk area, a plastic lop shell, and a thin metal liner to com- 
plete ihe EMI enclosure. The liner is held to the cover via 
plastic heal slakes and has a series of fingers on each side of 
the cover to contact Ihe chassis and contain EMI radiation. 
The flexible disk bezel is designed to snap into the front of 
the l over, w hich then configures the frontal appearance of 
Ihe product. The cover assembly drops vertically onto the 
chassis and then slides rearward until ali.nnmrui books and 
snaps in the cover engage lo hold the cover in place. 
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External Peripheral Products 

The product design of! he three exlernal peiipherals also 

Includes a large degree of functional Integration. Each of 

these boxes is designed as a miniature Model 712 work- 
station, with HP-PAC cushions providing location and sup- 
port for the drive mechanism, a printed circuit board (for 
power Conversion), power switch plunger, and cabling. The 
plastic cover for each product includes any necessary doors, 
light pipes, and buttons. The chassis assembly of each prod- 
uct integrates a plastic base, metal chassis, spring flip, dress 
panel, and SCSI signal cable (attached With screws by the 
vendor). Thus, final assembly parts involved in the manufac- 
turing of the box include only the chassis assembly, internal 
power cable, printed circuit board, plunger rod, HP-PAC. 
disk mechanism, and top cover. Like the workstation, there 
are no fasteners for manufacturing or the customer to deal 
with, and the top cover snaps into place to retain all parts 
inside. 

Low Cost for Entry-Level Pricing 

To command lower material costs for mechanical compo- 
nents, all custom plastic and sheet-metal pans were hard- 
tooled for mass production. The chassis of each product 
was designed with a minimum of folded features to reduce 
part complexity and the cost associated with that complex- 
ity. All major sheet -metal pails use progressive tooling for 
the lowest price 

To reduce the amount of final assembly lime laud laboi 
costs) involved in the product, components were designed 



Fit 4. The Model 712 work- 
station showing components 

disassembled from the chassis. 

with a high degree of functional integration. Integrated com- 
ponents (such as chassis or top cover assemblies) are as- 
sembled by Vendors, placing the burden of labor on these 
non-HP processes and thus achieving lower pricing of the 
final product. This functional integration of components also 
lowers cost by reducing part count and related inventory 
management. 

Because of the no-fastener design, final assembly takes 
under four minutes for the workstation product and compa- 
rable times are achieved for the external peripherals. This 
ease of manufacturing lowers manufacturing costs because 
of reduced assembly lime ;uid overhead costs. It also makes 
the product much belter suited tO indirect market channels, 
Which prefer to configure products themselves and often do 
this at the last possible moment before shipment. 

Environmentally Friendly 

The Model 712 workstation and peripherals also conform to 
HP's new guidelines lor product stewardship. Virtually every 
component of the workstation and peripheral products can 
be easily disassembled, identified, and recycled. Each plas- 
tic part contains engraved information that identifies tin- 
type of plastic used, and only four different types of plastic 
are used within the entire family of products. To assist the 
disassembly process, the products use plastic heal staking 
to join parts together, which can easily be cut away during 
the disassembly process. The new HP-PAC material can be 
recycled as well, either by grinding to pellet size and reusing 
in other shipping cushion pans, or by inching Ihe material 
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down to solid plastic. And again, because there are virtually 
no fasteners to deal with, disassembly is quick and thus 
more parts are given to recycling. Materials with bromide 
compositions have been avoided, except for the IIP-PAC 
pails, which require a bromide flame-retardanl treatment to 
meet safety requirements. 

Other product stewardship features include: 

No painted components (all plastics with molded colors) 

No plated plastics 

No adhesives 

Required labels can be recycled along with plastic base 
material 



• Reusable aftermarket components ( flexible and hard disk, 
power supply, CPU, and Tan) 

• ISulk packaging or final assembly components implemented 
on larger pails (reduces manufacturing waste) 

• Printed circuit boards built in approved non-ODS (ozone- 
depleting substance) processes 

• Embedded fan (low acoustic nuise). 
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Development of a Low-Cost, 
High-Performance, Multiuser Business 
Server System 

Using leveraged technology, an aggressive system team, and clearly 
emphasized priorities, several versions of low-end multiuser systems 
were developed in record time while dramatically improving the product's 
availability to customers. 

by Dennis A. Bowers, Gerard M. Enkerlin, and Karen L. Murillo 



The HP !»000 Series 800 Models E25, E35. E45. and E55 
(Exo) and the HP 3000 Series 908, 918, 928, and 938 (9x8) 
business servers were developed as low-cost, performance- 
enhanced replacements for I In- HP 9000 F Series and low- 
end G Series and the HP 3000 Series 917, 927, 937, and 947. 
The development of the PA-RISC PA 7100LC processor chip 
and the LAS! (LAN/SCSI) I/O interface and the evolution of 
DKAMs for main memory enabled the development of these 
low-end servers. The PA 7100LC and the LASI I/O interface 
an described in the articles on pages 12 and 36 respectively. 

The priorities for the Models Exo and Series 9x8 server 
project were short lime to market, low cost, and improved 
performance. The functionality and quality of the new sen - 
el's were to be as Rood as the products they were replacing, 
if nol belter. The challenge was to gel these new servers lo 
market as soon as possible so that IIP could continue to be 
competitive in the business server market and our customers 
could benefit from better performance at a lower price We 
were able to gel the first versions of these systems com- 
pleted, released, and shipping on time with all new VLSI 
components. 

Low Cost , Higher-Performance Features 

The principlal reason for achieving high integration and low 
cost for the Model Ex5 and Series 9x8 servers was the devel- 
opment of the PA 7100LC processor chip, which was being 
developed at the same time as our servers. Integrating the 
floating-point unit, the IK bytes of internal instruction 
cache, the external cache interface, the TLB (translation 
lookaside buffer), the memory controller, and the general 
system connect (GSC) I/O interface inside the PA 7100LC 
processor chip allowed the Model Exo and Series 9x8 
designers lo condense the < PI ' and main memory onto 
the same board. 

Also, al the same lime as our new servers were being devel- 
oped. DRAM densities doubled (in some cases quadrupled) to 
allow more memory to be put into a smaller space. The 
Model Exo and Series 9x8 servers use the same industry- 
standard ECC (eiTor correct ion coded) SIMM modules used 
in the HP 9000 Model 712 and oilier HP workstations. The 
Model Exo and Series 9x8 servers use 16M- and 32M-hyte 



SIMMS which must be inserted in pairs to provide 32M to 
256M bytes of main memory. ECC memory was chosen be- 
cause it carries two additional address lines making il pos- 
sible to put four limes the memory capacity on one SIMM 
while staying compatible with industry-standard modules. 
The 64M-byte SIMM was designed several months after first 
introduction of the new low-end servers to boost their maxi- 
mum memory to 512M bytes. This larger SIMM is not avail- 
able as an industry standard. 

Four versions of the Model Ex5 and Series 9x8 processor 
have been developed, differentiated by clock speed, cache 
size, and cost. Each version is fully contained on the system 
board (which also contains cache, main memory, processor 
dependent hardware and firmware, and 802.3 LAN connect) 
and is easily installable and upgradable. Table I lists the 
technical Specifications for the different Model Ex5 systems 
and Summarizes the HP-l'X* performance characterizations. 
The Series 9x8 MPE/iX systems have equivalent CPl" hard- 
ware, and their specifications are close to those given in 
Table L 



Table I 

Technical Specifications for HP 9000 Model Ex5 Systems 
Running the HP-UX Operating System 

Processor Performance Models 





E25 


E35 


E45 


E55 


Clock (MHz) 


48 


64 


80 


96 


SPE(int92 


44 


65 


80 


104 


SPE( fp92 


(16 


98 


120 


156 


()LTPTransactions/s 


80 


125 


155 


180 


Standard memory/cache 
i M bytes i 


16 


16 


16 


16 


Maximum memory 
(M bytes, 


512 


512 


512 


512 


Cache size ( K bytes) 


64 


256 


256 


1000 


( ache SRAM speed 1 ns) 


15 


12 


10 


7.5 



© Copr. 1949-1998 Hewlett-Packard Co. 



A|>ni toss HewteU^aofeatd Journal 79 



Architecture 

fig 1 Shows a block diagram for the Model Ex-") and Series 
t)x8 servers. 

The general system connect (GSC) bus was designed as a 
new, more |iowerl'ul syslein bus Tor higher performance. The 
Model Ejc5 and Series 9x8 seivers only use Uie GSC bus for 
the processor, mam memory - , and 802.3 LAN through Hie 
Ij\S1 chip. The inidraiige and high-end server systems also 
support the GSC bus as their high-performance I/O bus. All 
PA-RISC systems support the HP-PB 1 (HP precision bus) as 
l he common I/O bus because multiple functionality (hard- 
ware and drivers) currently exist for this bus. The interface 
from the GSC bus to the HP-PB is accomplished in a chip 
called the HP-PB bus converter. 

The HP-PB bus converter chip is a performance-improved 
version of the bus converter that was used in the HP 9066 F 
and G Series and HP 3000 Series 9x7 machines. This chip 
allows the Model Iix5 and Scries 9x8 servers to leverage 
HP-PB I/< ) functionality from the systems they are replacing. 



The IIP-PB bus converter implements transaction bufferingt 
as an IIP-PB slave, gaining performance improvements of 
10% to 28% over its predecessor. The chip supports GSt to 
IIP-PB clock ratios ranging from 3:1 to 5:1 in synchronous 
mode when the GSC bus is operating under 32 MHz. It 
switches to asynchronous mode when the GSC bus operates 
in the 32-to-40-MHz range. These ratios and the asynchro- 
nous feature of the IIP-PB bus converter allow fair flexibility 
in CPU and GSC operating frequencies while maintaining a 
constant K-MIIz IIP-PB frequency. The bus converter also 
provides an interface to the access port used for remote 
support, and the control signals used for the chassis display 
and status registers. The chip is designed for the HP 
CMOS26B process and conies packaged in a 208-pin MQFP 
( metal quad flat pack). 

The other key VLSI chip used in the I/O structure for the 
Model Ex5 and Series 9x8 servers is the LASI chip. The LASI 

t Wilh transaction buffering,, during reads from disk, daia is buffered so that HP-PB transac- 
tions can continue at maximum pace 



System Card 



Unilied Instruction 
and Oala Cache 
64K Bytes (48 MHz) 
Z56K Bytes (64 MHz) 




PA 7100LC 
Processor 



64-Entry TLB 
(8-Block Entries) 



Memory 
and 1/0 
Controller 



Floating-Point 
Coprocessor 



Memory SIMM Pairs 
(256M Bytes Maximum) 

Two 8M-Byte SIMMs and/or 
Two32M-Byte SIMMs 
per Pair 



General System Connect (GSC) Bus 





Battery 


_i i_ 




Real- 




flcrillatnr 


Time 




U M< 1 1 1(1 UJ 1 


Clock 

LASI 1/0 
Interface 


Flash 
EPROM 
(Processor 
Oependenl 
Code) 






EEPR0M 


LAN 




(Stable 


(HP 9000) 




Storage) 



HP-PB 
Bus 
Converter 



i 


» ^AUI (Attachment 




Unit Interface) 




TP-MAU (Twisted 




► Pair Media 


Attachment Unit) 


HP 9000 Only 



Power 




Supply 





Uninterruptable 
Power Supply 
(External) 



AC Line In 



T 



HP-PB Slots tor 
Expansion 1/0 



Slot 
9 


Slot 
11 




m 


Slot 
10 


Slot 



HP Precision Bus 
(HP-PB) 



Multifunctional 
x I/O Card ^ 

5 => 



(Backplane) 



5%-in or 3'/z-in Internal Disk 
(Two Maximuml 



SCSI 
' Out 



DDS 


Disk 1 


CD-ROM 


Disk 2 



HP 9000 Only 



H 1/4-in Cartridge Tape 



HP 3000 Only 

AUI 
Thin-AUl 
Local Console 
Remote Console 



Flexible Disk 



-► Centronics Port 



8-Port 
Direct Port 
Connector 



8-Port 
Master 
Distribution Panel 



RS-232 
Ports 




Fig. 1. Block diagram for the IIP UOlMl Series Him Models Kxo and the HP Willi Series 0x8 business servers. 
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c hip is designed to have the same integration impact on core 
I/O as the PA 7100LC had on the CPU and GSC bus interface. 
The workstation products are able to take advantage of this 
(see article on page 6). but the multiuser server systems 
were not able to lake advantage of LASI fitnctionality. 

LASI functionality includes interfaces to IEEE 802.3 IAN. 
SCSI, processor dependent code, Centronics. RS-232, audio, 
keyboard, flexible disk, and GSC bus arbitration logic and 
die real-time clock. Because BP-OX and MPE/iX software 
drivers could not be made available in time for our release, 
only a small subset of LASI functionality" could be used on 
the new servers. Thus, die decision was made to continue 
using the core I/O card from die previous versions of low- 
end servers because it provides all the functionality needed. 

For the 96-MHz version of the Model Exo and Series 9x8 
servers, a Chip with a subset of the functionality of LASI was 
used. This was developed as a cost reduction for those ap- 
plications that use only the LAN. GSC bus arbitration, and 
processor dependent code path. The 915-MHz version had lo 
add a real-time clock on the system board to have equivalent 
functionality io what was needed from LASI. 

In addition to the above VLSI chips and printed circuit 
boards, the Model Exo and Series 9x8 servers have the inter- 
nal capacity for two disk drives (40 bytes), (wo removable 
media devices, and up to four I/O slots. The packaging and 
power supplies for the new servers are highly leveraged 
from the previous low-end server systems. 

Meeting Fast lime-to-Market Goals 

Meeting deadlines for any program is always a challenge. 
Too often il is believed iha( a few extra hours a week is all 
thai is needed to keep the project on track. Bui many well- 
intentioned programs soon lose time with unexpected de- 
lays even when the project team is made Up of industrious 
folks willing (o do whatever il lakes to slay on schedule. 

Ai large corporations like IIP. where releasing a product lo 
market may span several divisions, the task is even more 
daunting. Willi our lab's mission of providing world-class 
low-end commercial business systems and servers, time lo 
market is always expected to be a key objective. In the case 
of the Model Exo and Series 8x8 program, it was the primary 
objec tive. Additionally, we were challenged lo keep cost 
projections in line with the set goals, and to meet or exceed 
the quality of the versions of the low-end servers thai we 
were replacing. Quality is consistently a key objective on all 
IIP products. 

The main challenge for the Model Exo and Series 9x8 
program was to achieve (on schedule) an order fulfillment 
cycle timet of 10 or fewer clays for the entire product family. 
Willi the existing product family averaging order fulfillment 
cycle times four lo five times larger than our 10 or fewer 
days goal, il was evident that for the new servers a 
well-orchestrated program that involved the entire system 
Irani was necessary to meet this challenge. 

Fig. J shows a spider chart of the overall met l ies for the 
Model Ex5 and Series 9x8 program. Nole that (he program 
achieved or exceeded all planned manufacturing release 
goals. Even the factory cost goal was exceeded, which was 

t Order (Ulflllmant cycle lime Is measured from when HP receives a customer's order lo Ihe 
lime when the order is delivered al (he customer's dock 
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Fig. 2. Spider diagram showing how well die project team Tor the 
Models Exo and Series 9x8 servers met their project Roals. 

at risk when the hardware team added existing material into 
the design over less expensive func tionality, reducing the 
software development schedule. The order fulfillment cycle 
time objective was hot only achieved but exceeded! For the 
first three months of production, order fulfillment cycle time 
av eraged under nine days. 

The following sections summarize the reasons we met or 
exe led ourcost, quality, titne-tu-maiket, and manufac- 
turing release goals. 

Consolidation of Project Team. When the Model Ex5 add 
Series 8x8 program was in its early stages of design the de- 
velopment team was dispersed in two different geographic 
locations. The remote organization was eliminated and the 
project development and management were consolidated in 
one location under one manager. With Ibis organization, 
technical decisions regarding system requirements could be 
made quickly and effectively. 

Ownership ol Issues. A system team composed of represen- 
tatives from the different organizations involved in Ihe de- 
velopment of the Models Ex"> ;utd Series 9x8 servers was 
organized. Weekly one-hour meetings were held with die 
main focus on issues or concerns that impacted the project 
SC&edUle, Communication was expected to be limited to 
discussions that affected everyone. Issues were captured 
and assigned an owner With a dale assigned for resolution of 
the issue. Representatives at the meeting were expected to 
own the issues thai were presented to I heir organization. No 
issue was closed until the team agreed upon it. This ensured 
that technical problems did not "bounce" around looking for 
an owner. 

Interclivisional Communication. Effective interdivisional teams 
establish good working relationships to ensure timely re- 
sponse to actions and issues. An example was the dec ision 
to change the core I/O functionality. While the hardware 
team improved their factory cost by incorporating new. less 
cosily hardware, the software team would have realized a 
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longer schedule to provide new software features to support 
the new hardware. After reviewing the plans, the hardware 
team, aware of the critical time-to-market objective, recom- 
mended a return to the existing I/O feature implementation at 
an impact to factory cost for the sake of the software devel- 
opment team's ability to improve their schedule. The end 
result was that the hardware team still achieved their factory 
cost goals (by making adjustments elsewhere), and the soft- 
ware development team achieved their schedule goals. 

Leverage Design Where Possible. When time to market was 
established as the key objective for the project, the develop- 
ment learns realized that leveraging from as many existing 
products as possible would greatly benefit achieving this 
goal. The following components were leveraged from new 
or existing products: 

Product package. Sheet metal was leveraged from the exist- 
ing low-end business servers with minor changes to accom- 
modate new peripherals and a different processor and main 
memory partitioning scheme. Plastic changes were kept to a 
minimum in an effort to use tools already established. (Only 
one new tool was required. ) 

Base system configuration. The base system was established 
using (he I/O printed circuit boards a&d several peripherals 
available on the existing low-end servers. 
Memory. The memory design was leveraged from the mem- 
ory configuration used in the HP 9000 Model 712 work- 
station, which uses SIMM modules for the base memory 
system. Higher-density memory was designed specifically 
for the Model Ex") and Series 9x8 servers after first release 
to increase their maximum memory capacity. 
Power supply. The power supply was leveraged from the 
existing servers. 

Printed circuit boards. The core I/O boards from the exist- 
ing servers were used with only minor firmware changes to 
the HP-l.'X version. The processor board and backplane 
were new designs based on ideas shared with the Model 712 
development team. 

VLSI. The PA 7100LC processor chip and the LASI core I/O 
chip were leveraged from the Model 712 workstation, which 
was being designed al the same time as our server systems. 
Firmware. Some of the firmware and I/O dependent code 
was codeveloped with the Model 712 development team. 

Fast Time to Manufacturing Release. The use of concurrent 
engineering played a key role in reducing the back-end 
schedule. The back end of the schedule consists largely of 
manufacturing activities (including final test and qualifica- 
tion) aimed at achieving a release of the product for volume 
shipment. In the case of the Model Ex"> and Series 9x8 serv- 
ers, with the individual boards being built in two geographi- 
cally different manufacturing facilities, it was imperative 
that communication between these entities receive ample 
attention. 

To facilitate llu's conmiuiucation, a coordination team consist- 
ing of new product introduction engineers and new product 
buyers and logistics people were located in close proximity 
with the R&D development team. Everyone attended the 
system team meetings, which were led by the hardware lab, 
to ensure that the most current information was applied to 
the Overall system schedule. In addition, production build 



meetings were held before, during, and after each prototype 
run to discuss build results. Ensuring that all manufacturing 
personnel realized that these systems were engineering pro- 
totypes, With a high potential for problems, was a difficult 
task. Most people were not used to seeing lab prototypes 
being built in a production process. Since the line was 
shared with currently shipping products, it was extremely 
important to ensure that building the prototypes did not 
impede shipping other products. 

Prototype Management. Two operating system environments 
were required for the new servers, the HP-UX operating 
system release 9.04 and the MPE/iX operating system ver- 
sion 4.0. Since these environments were under development 
at the same time as our products, it was essential that hard- 
ware prototypes be delivered efficiently and be of sufficient 
quality to ensure expedient use by the software develop- 
ment groups. Thus, three key objectives were considered 
essential by the development groups. First, units had to be 
of the highest quality. Second, delivery of the units had to be 
on lime. Finally, downtime because of hardware problems 
had to be minimized. 

To accomplish the first goal, all prototypes were built using 
the en I ire production process. No prototypes were hand- 
crafted in the lab. This ensured that units were built with the 
same quality standards as are applied to released systems. 
Additionally, each customer was assured of receiving the 
latest revision of materials released to production. Even 
new parts not covered under manufacturing release criteria 
were guaranteed to be of the same revision lev el. All revi- 
sion levels were tracked on each unit for the life of the proj- 
ect. 

For the second objective, a customer priority list was gener- 
ated based on customer orders and needs. After the orders 
were submitted to the manufacturing systems, build priorities 
were set based on the critical needs being supplied first. 
From functional prototypes to production prototypes, up- 
grade kits were Structured and made available. In cases 
where a new system was not required, customers had the 
option of moving immediately to an upgrade. Also, perfor- 
mance upgrades were designed to require a swap of the 
processor card only. 

Tracking I he revision level of all hardware was essential to 
achieving the third objective of minimizing downtime be- 
cause of hardware. Another key point was being able to 
react to a customer's problem quickly. We used a prerelease 
support team at another HP division to ensure timely re- 
sponse. Spare material was purchased by the support team 
and defective parts were returned to the lab for analysis. 

Using all these methods, we were able to achieve the goal of 
having all operational prototype units upgraded to manufac- 
turing release equivalence before manufacturing release. 
This guaranteed test partners use of the machines for future 
development without the "not-quite-final-product" concerns. 

We were not without our share of problems in terms of 
effectively managing the prototypes. For instance, several 
units were placed inside an environmental test chamber for 
weekend testing. During the early morning hours on a 
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Sunday, the temperature controller of the chamber went out 
of control, ramping the temperature to beyond 70 -('. The 
additional heat caused the fire sprinkler system in the cham- 
ber to turn on. flooding the clumber at a rate estimated at 10 
gallons per minute. The units were standing in four feet of 
water, but with the disk drives external to the chamber, the 
test continued. When the chamber was finally shut down, 
the water mopped up. and the results checked, it was dis- 
covered tltat two of the seven units, which were on the top 
rack, out of the standing water, continued to operate with- 
out failure throughout the test. Tliis test was affectionately 
named the "bathtub test." 

Time-to-Market Focus. Establishing the time to market as the 
key objective for the program was not enough to ensure its 
success, lite teams involved required constant reminders to 
stay focused on this objective and make trade-offs accord- 
ingly. Once the schedule was confirmed and accepted, it was 
important to acknowledge the progress. Any activities that 
appeared in danger of jeopardizing the schedule were re- 
viewed and tackled accordingly. 

However, the project team realized that in the past changes 
to system requirements had a big impact on meeting project 
schedules. Changes to system requirements to modify or 
include a feature that might improve sales or could be easily 
implemented at the cost of another metric might result in 
significant changes to the hardware or operating system 
design. In the case of (lie Model Ex5 and Series 9x8 servers, 
the system team implemented a process that was also used 
by the software development teams to control design 
changes. This process is called rhinit//' control, which re- 
quires the change requester to provide a specific level of 
informal ion to determine whether a particular change is 
viable. While ihis is uol a new idea, the Model Exfi and 
Series 9x8 development team elected to make one addi- 
tional rule change. Each change request submitted would be 
briefly looked at to determine how the change would affect 
the base system. In other words, we wanted to ensure that a 
change was critical enough that it needed to be added to the 
products planned for the first release. 

The hardware system team put on hold all change requests 
thai were determined not to be required for the first release. 
To avoid causing lots of changes to the software after first 
release, some of the critical enhancements that were consid- 
ered crucial to future sales were briefly reviewed and in- 
cluded in the initial software release. In some cases Ihis 
meant no changes were required after the first software 
release. However, there were some instances of patches 
required for full functionality. 

Customer Order Fulfillment Cycle Time 

For the Model Ex5 and Series 9x8 servers to slay competitive, 
cost and performance were not the only items that played an 
important role. During 1993, il was clear that HP had an order 
fulfillment cycle lime problem, which of course made our 
customers unhappy and affected our competitiveness, A task 
force was formed to address HP's order fulfillment cycle time 
problems. We found out that results from this task force 
would not arrive in time to help us with our new products. 
Thus, we formed a team sev en months before introduction to 
ensure thai the reduced order fulfillment cycle time process 
for the Model Bx6 and Series 11x8 servers was in place when 
the products were ready to be shipped to customers. 



Our goal was to reduce the time between the rec eipt of a 
customer purchase order for a system and the time when 
the system is delivered to the customer site. We wanted to 
reduce this time by 79% of what it was for our existing serv- 
ers. To accomplish this goal, the following clianges wen? 
made before product introduction: 

• The product structure was made much simpler and it 
includes fewer line hems. 

• Product offerings to distributors were unbundled. 

• Product numbering for distributors' orders had a single SKI' 
(stock keeping unit ) for ease of ordering. 

• The rules for our factory configuration system and field 
configuration system were mirrored. 

• Early and proactive material stocking was performed before 
introduction to ensure that plenty of material was on hand 
to meet customer demand immediately. 

• Factory acknow ledgments were automated for clean 
orders. 

• Intensive training was given to order processing personnel 
in the field and the factory about the Model Ex5 and Series 
9x8 servers two months before introduction. 

• Consignment, demonstration, and distributor units were 
stocked before introduction. 

• More capacity was added to the factory, and assembly 
proc esses were streamlined. 

• All new processes were tested intensively before 
introduction. 

With these steps we were able to meet and exceed our order 
fulfillment goal. 

Conclusion 

The real success of the Model Ex5 and Series 9x8 server 
program was that the goals for fast time to market and re- 
duced order fulfillment cycle time were achieved These 
were major accomplishments considering the events thai 
took place throughout (he whole project including I he devel- 
opment of a major VLSI component, consolidation of the 
design team from different divisions and locations, commu- 
nication between different manufacturing entities, and a 
stream of last-minute catastrophes suc h as Hooding proto- 
types in the environmental test ovens and several eleventh- 
hour VLSI bugs that had to be fixed. 
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HP Distributed Smalltalk: A Tool for 
Developing Distributed Applications 



An easy-to-use object-oriented development environment is provided 
that facilitates the rapid development and deployment of multiuser, 
enterprise-wide distributed applications. 



by Eileen Keremitsis and Ian J. Fuller 



IIP Distributed Smalltalk is an integrated set of frameworks 
that provides an advanced object-oriented environment for 
rapid development and deployment of multiuser, enlerprise- 
wide distributed applications. Introduced in early 1993, 
and now in its fourth major release. HP Distributed Small- 
talk leverages the ParrPlace Smalltalk language and the 
VisualWorks development environment. Together, HP Dis- 
tributed Smalltalk and VisualWorks enable rapid prototyp- 
ing, development, and deployment of C'ORBA-compliant 
applications. I 

In the global marketplace, coiporate information technology- 
needs are increasingly demanding because worldwide com- 
petition requires geographically dispersed operations, chang- 
ing markets require agility to remain competitive, pressure 
to improve return on investment requires strong cost con- 
trols, timely access to complete information is crucial for 
business success, and finally, corporate users require access 
to both legacy and newly developed Information sources 
and applications. 

IIP Distributed Smalltalk helps answer these business needs 
by supporting 

Easy on-demand access lo information and services across 
the enterprise 

Dynamic interaction of distributed people and resources 
Greater application flexibility and ease of use 
Insulation from differences in operating environments 
An architecture that supports an evolutionary approach 
including legacy system integration 

Industry standards thai will allow application interoperabil- 
ity across languages, high productivity, and code reuse. 

Customers can lake advantage of HP Distributed Smalltalk's 
easy-Io-use development environment lo create distributed 
solutions to compete effectively in the global marketplace. 
For example, with IIP Distributed Smalltalk, customers might 
build on the sample Forum application (described later) so 
thai their geographically dispersed users can simultaneously 
annotate a shared document. Also, customers might use IIP 
Distributed Smalltalk to create three-tiered database access 
applications that extend the advantages of existing client- 
server architectures for better isolation between user inter- 
faces, data manipulation models, and legacy and new data. 

I CQRBA, or Common Object Bequest Brokei Architecture, defines a mechanism that enables 
obiects to make and receive requests and responses HP Distributed Smalltalk's impleman 
tation ol this architecture is described later in this article 



Three-tiered applications are the most efficient and scalable 
form of software design for building complex applications. 
They carefully separate the user interface (tier one) from 
the business rules governing the application (tier two J and 
die persistent storage for the information in a database (tier 
three). Each tier can reside on a different machine in a net- 
work, making best use of the network resources. HP Dis- 
tributed Smalltalk contains objects that enable the straight- 
forward construction of these applications. 

Using HP Distributed Smalltalk 

An application written in HP Distributed Smalltalk is able to 
respond to service requests from remote systems. Remote 
entities thai request services of an application do not have 
to be written in HP Distributed Smalltalk as long as they are 
in a system thai implements the standard ORB (object re- 
quest broker) and common object services from the Object 
Management Group (OMG). See "Object Management 
Group 1 ' on page SO for a description or these items. 

In many cases an IIP Distributed Smalltalk application's 
component objects are distributed across several systems. 
These distributed objects can interact seamlessly so that 
end users are unaware of w here the objects are located. 

An overview of the process of running an HP Distributed 
Smalltalk application is shown in Fig. L For incoming re- 
quests to the service provider, the ORB translates requests 
from the implementation-neutral Interface Definition Lan- 
guage (IDL) to the local language (ParcPlace Smalltalk) and 
forwards them to the Correct local object for processing. To 
complete the request, the service provider's ORB takes re- 
nun values, translates them lo IDL and forwards them to the 
remote ORB from which the request was received. 

Not only does HP Distributed Smalltalk support distributed 
application delivery but it also provides an environment for 
distributed application development, which includes: 

• A complete implementation of the Object Management 
l iiimp's latest standards 

• A rich suite of tools for application development and admin- 
istration including simulated remote test support, a remote 
debugger, and an IDL interface browser and generator 

• A user interface environmenl and sample applications thai 

developers can reuse or extend, or simply use to become 
familiar wilh the system. 



© Copr. 1949-1998 Hewlett-Packard Co. 



Vprfl IBSeHewtettJsclnDtf Journal 85 



Object Management Group 



The Object Management Group, or OMG, is a nonprofit international corporation 
made up of a team of dedicated computer industry professionals from different 
corporations working on the development of industry guidelines and object man- 
agement specifications to provide a common framework for distributed application 
development 

OMG publishes industry guidelines for commercially available ob|ect-onented 
systems, focusing on areas of remote object network access, encapsulation of 
existing applications, and object database interfaces. By encouraging industrywide 
adoption of these guidelines. OMG fosters the development of software tools that 
support open architecture, enabling multivendor systems to work together 

To define the framework for fulfilling its mission, in 1992 OMG published its Object 
Management Architecture Guide. This guide provides a foundation for the develop- 
ment of detailed interfaces that will connect to the elemental components of the 
architecture. Fig. 1 shows the four main components of this architecture: 

• The object request broker (ORB) enables objects tu make and receive requests and 
responses in a distributed object-oriented environment. 

• Object services is a collection of services with object interfaces that provide basic 
functions for creating and maintaining objects. 

• Common facilities is a collection of classes and objects that provide general-purpose 
capabilities useful m many applications 

• Application objects are specific to particular end-user applications. 



Application I I Common 

Objects I Facilities 



: 



Object 
Services 



Fig. I, The object managemeni archiiecture. 

The application objects, object services, and common facilities represent groupings 
of objects that can send and receive messages The software components in each of 
these primary components have application programming interfaces that permit their 
participation in any computing environment that is based on an object technology 
framework 



In addition, because HP Distributed Smalltalk is Sll exten- 
sion of Visual Works, developers are able to do their pro- 
gramming in a language they already know ( ParcPlace 
Smalltalk) using the Visual Works application builder. 

VisualWorks is an implementation of the Smalltalk program- 
ming language and environment. It provides an excellent 
environment for building standalone and simple client/server 
applications that are 100% portable between many of the 
major computing platforms and operating systems. HP saw 
an opportunity to enhance the capabilities of VisualWorks to 
be the basis for next-generation applications by adding ob- 
jects that enable VisualWorks systems to communicate di- 
rectly using a standardized set of communications facilities. 

Framework 

The IIP Distributed Smalltalk framework is an environment 
that encompasses everything from communication with 



other systems through database access to the object-ori- 
ented ParcPlace Smalltalk language and a rich suite of devel- 
oper's tools, all seamlessly integrated to facilitate distrib- 
uted application development. 

The major components of IIP Distributed Smalltalk are 
shown in Fig. 2 and briefly defined below: 
HP Distributed Smalltalk ORB. This is a full implementation 
of the Object Management Group's Common Object Request 
Broker Architecture (CORBA). 

Remote Procedure Call (RPC) communication. This com- 
ponent supports efficient and reliable transfer of messages 
between systems. 

IIP Distributed Smalltalk object services. This includes all 
standard object services required by dist ributed systems, as 
well as support for creating and maintaining objects and the 
relationships between (hem. 



HP Distributed 
Smalltalk System 




Network Connection 



ORB - Object Request Broker 

IDL = Interface Definition Language 



Fitf. 1. Overview of Che HP 
Distributed Smalltalk process. 
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HP Distributed Smalltalk 
Sample Applications 



HP Distributed Smalltalk 
User Environment and Services 



HP Distributed Smalltalk 
and VisualWorks Developer Tools and Services 




OODBMS and ROBMS Access 




Mulliplatform Support 


HP Distributed Smalltalk 
Object Services 


HP Distributed Smalltalk ORB 


RPC Communications 









Fig. 2. Tlie major components of HP Distributed Smalltalk. 

Multiplatform support. IIP Distributed Smalltalk applica- 
tions that nm on one platform ( hardware and operating 
system combination) can run, without porting, on any other 
supported platform. 

OODBMS and RDBMS access. HP Distributed Smalltalk pro- 
vides database access direc tly to HP's Odaptert and Servio's 
GemStone as well as to Sybase and Oracle (via Visual- 
Works). HP Odapter can be used to provide access to a 
variety of other database systems. 

HP Distributed Smalltalk developer tools and services. This 
level or the framework provides support specifically de- 
signed for developing, testing, tuning, and delivering distrib- 
uted applications. HP Distributed Smalltalk incorporates a 
rich development environment, application builder support, 
and the ParcPlace Smalltalk language. 
HP Distributed Smalltalk user environment and services. 
These services include a reusable demonstration user inter- 
face and desktop environment support for users" work 
sessions and normal desktop activity. 

HP Distributed Smalltalk sample application objects. These 
objects provide developers with example code that can be 
reused or extended, or can provide a source of ideas for 
developing alternate applications. 

The following sections provide more detailed descriptions 
ul the components that make up HP Distributed Smalltalk. 

HP Distributed Smalltalk Object Request Broker 

HP Distributed Smalltalk is a complete implementation of 
C'ORBA, the Object Management Croup's specification of an 
object request broker. HP Distributed Smalltalk's compliance 
provides the basis for object and application interoperability. 

C< )RBA specifies core services that are required of an object 
request broker to support interoperable distributed comput- 
ing. The CORBA Specification includes the following core 



are accessible to service requesters that might be written in 
Smalltalk. C. C+*. or another language. 

OMG recently approved the lDL-to-Smalltalk language bind- 
ing proposed by HP and IBM. This is important because it 
allows users to build distributed systems using multiple lan- 
guages where appropriate, allowing a Smalltalk object to be 
able to request services of a C++ object or vice versa. 

Interlace Repository. This service provides a registry of distrib- 
utable object interfaces for a given system. Any object that 
remote objects can access has an interface in the interface 
repository. For example, when objects on two or more sys- 
tems at different locations collaborate in an application, they 
interact by sending messages to their interfaces. Since ex- 
ternal clients have access to an object's services only 
through the object's interface, the implementation of the 
object is private. This privacy provides a variety of benefits, 
including security. language independence, and freedom to 
modify the implementation of how a service is performed 
without external repercussions. 

HP Distributed Smalltalk ORB Support. The object request 
broker (ORB) is the key to providing support for distributed 
objects. By providing an ORB on each system, HP Distrib- 
uted Smalltalk makes the location of any object transparent 
to clients requesting services from the object. 

When a message is sent to a local object, the activity is han- 
dled normally. When a message is sent to a remote object, 
the remote object's local surrogate (created aufomatically 
hy the ORB) intercepts the message, then uses the ORB to 
locate the remote object and communicate with it (see Fig. 
3). Results returned to the calling object appear exactly the 
same, whether the message went to a local or remote object. 

An ORB's responsibilities include: 

• Marshalling and unmarshalling messages (translating ob- 
jects to and from byte .streams for network transmission ) 

• locating objects in other images or systems 

• Routing messages between surrogates and the objects they 
represent. 

While a request is active, both client and server ORBs ex- 
change packet information to track the course of the request 
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Apparent 
Connection 



Machine B 
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Object 
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Interface Definition Language Compiler, t >M< '< has defined the 
Interface Definition Language, or IDL. to be independent of 
oilier programming languages. Interfaces for objects that can 
provide distributed services are written in IDL so that they 

I HP Odapter is a complementary product tram Hewlett-Packard that provides an efticient 
and scalable link between objects implemented in an ub|ecl-oriented language such as 
Smalltalk or C» t and the entities in an Oracle rotational database 
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Ftp;. .3. HI' Distributed Smalltalk handles remote access so that a 

request to a remote object appears the same as a request Eos local 
object, 
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and resolve any network or transmission errors that might 
occur. 



Machine A 



Machine B 



Machine C 



Object Services and Policies 

Object services extend the core ORB services to support 
more advanced object interaction. UP Distributed .Smalltalk 
implements OMG's Common Object Services Specification 
(COSS), which extends CORBA to provide protocols for com- 
mon operations like creating objects, exporting and destroy- 
ing objects (life cycle), locating objects (naming), and asyn- 
chronous event notification. Additional object services and 
policies provide efficient interaction between finer-grained 
distributable objects. 

Naming.! There is a standard for assigning each obji'H a 
unique user-visible name. Names are used to identify and 
locate both local and remote objects. 

Event Notification.) This is a service that allows objects to 
notify each other of an interesting occurrence using an 
agreed protocol and set of objects. 

Basic i and Compound Life Cycle. There are standard ways for 
objects to implement activities such as create and initialize, 
delete, copy, and move both simple and compound objects, 
externalize (prepare for transmission to remote systems), 
and internalize (accept objects transmitted from remote 
systems). Compound objects, built from simple objects, can 
include application components, anything that appears on a 
users desktop (such as a document, a mail handler, or a 
graphics toolbox), complete applications, and so on. 

Relationships: Containment and Links. Links allow networked 
relationships among objects. Objects can be linked together 
with various levels of referential integrity (determining how 
to handle situations when one of the parties to the link is 
deleted), and in one-to-one, one-to-many, and many-to-iuany 
relationships. 

Together with links, containment establishes and maintains 
relationships between objects. Each object has a specific 
location within some container. Containers are related hier- 
archically. HP Distributed Smalltalk provides objects that 
implement a generic distributed container. Programmers 
can use these objects to build specific implementations such 
as an electronic mail envelope ( containing components of a 
message) or a bill of sale (containing information about 
items in a shipment ) with minimal extra programming. 

Properties and Property Management. Properties are part of an 
object's external interface (owner, creation date, modifica- 
tion date, version, access control list, and so on). They are a 
dynamic- version of attributes. 

Application Objects and their Assistants. Application objects 
are relatively large-grained compound objects that end users 
deal with (e.g.. a file folder or an order entry form). Applica- 
tion assistants are lightweight objects that implement most of 
the policies and participate in most of the services that desk- 
top objects need to participate in. Applicat ion assistants func- 
tion as the developer's ambassador into the object services. 
Application assistants can be stored and activated efficiently 
and provide the basis for future transaction support. 

l This service is specified in COSS 1.0. 



Chart Object 
(Semantic) 




Presentation Objects 



Fig. 4. The bulk of user interaction is with local presentation ob- 
jects, minimizing and condensing the need to propagate spirant ically 
relevant changes over the network. Here for example, a user might 
Choose to look at a chart (semantic object.) as a pie. line, OT bar chart 
presentation object. 

Presentation/Semantic Split A logical split between distrib- 
uted objects, the presentation/semantic split provides an 
efficient architecture for distributed applications. Local pre- 
sentation objects handle the bulk of user interaction, while a 
semantic object (which can be anywhere on the network) 
holds a shared persistent state of the object (see Fig. 4). 

By using the presentation/semantic split, the designer can 
choose what part of the application should be shared and 
what should be unique to each user. Applications that might 
use the presentation/semantic split include a team white- 
board where all behav ior Is shared but each user can write 
comments, or a common document with pages that are 
unique to each user so that all users can read at their own 
pace. A variety of sample applications included with HP 
Distributed Smalltalk provide illustrations of how to use the 
presentation/semantic split. 

While use of the presentation/semantic split is optional, it 
facilitates and optimizes distributed application develop- 
ment and execution. Advantages of using the presentation/ 
semantic split include: 

• Acceptable performance levels even over wide area 
networks 

• Association of a single semantic object with multiple pre- 
sentation objects, a critical feature in distributed computing 
environments where it is common for many users to work 
with the same application 

• Application access independent of local windowing systems 

• Better code reusability. 

The HP Soft ware Solution Broker described on page 93 is a 
good example of using the presentation/semantic split in an 
application. 

Developer Services 

HP Distributed Smalltalk also extends VisualWorks with 
services that support development and test of distributed 
applications. 
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Fig. 5. The control panel provides an easy-to-use interface to 
administrative and developer services. 

Control Panel. The technical user interface lo HP Distributed 
Smalltalk for administrators and developers is invaluable for 
testing and maintenance ( Fig. 5). The control panel provides: 
Controls to start and stop the system cleanly 
Support for local RPC testing (simulated distribution) 



• Tracing facilities to log network conversations between 
objects 

• Performance monitoring. 

Interface Repository Browser and Editor. The interface reposi- 
tory browser provides an iconic view of the contents of the 
interface repository where publicly available interfaces are 
specified (see Fig. 6). It is organized hierarchically so that 
developers can explore and edit interfaces and construct 
requests to use the interfaces. 

Shared Interface Repository. In HP Distributed Smalltalk, 
users can share an interface repository on a remote system 
so they do not have the overhead of keeping a copy of all of 
the interfaces on every system. The product also supports 
version management of interfaces, which is very important 
in large-scale, evolving distributed systems. 

Remote Context Inspector and Debugger. This service is an ex- 
tension that allows debugging on remote images when ap- 
propriate. It supports object inspection and debugging for 
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// Forum 

// This module defines the types and operations on forum obiects 

// 

module Forum [ 

epragma IDENTITV . 5a79e5b6-b08a-0000-020T-1c6B13000000 

//This interface defines the operations on Ihe forum presenter 

// 

interface ForumPres ContainerPret. TransparencyPres 0. 

•pragma IDENTITY . 5a79e6u3-06cb-OfJOO-020f- 1C68I3000000 

// 

II This interface defines the operations on forum semantic 
// ouiects 

U 

interface ForumSem ContainerSem. TransparencySem | 

epragma SELECTOR ■ presenterAdd user name 
// Activate Ihe piesenlalton object This will lesult in an 
H update sequence callback lo Ihe requesting PO before it 
// returns 

void presentationAdd ( 

in ForumPres presentation, 
in UserContext user, 
in stnng username), 

•pragma SELECTOR - presenterDelete user name 



Fig. 6. The Interface repository 

browser can be used lo view or 
edit interfaces tlial reunite clients 
can use to call local objects. 
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Open a debugger where you 
can trace the full stack on all 
involved machines. 



Step through the code. 



Inspect objects in the debugger 
or open inspectors on any ol the 
objects, regardless ol the system - ' 
they are running on. 
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Fijj. 7. Screens associated with a 
remote debugger. 



the entire distributed execution context, including commu- 
nication between images. Fig. 7 shows using llie debugger to 
step through code and inspect, objects lhal might be located 
anywhere in a distributed environment. 

Stripping Tool. To prepare an application lor delivery, devel- 
opers use tire HP Distributed Smalltalk stripping lool to re- 
move unneeded classes and interfaces and seal source code 
when application development is complete. The stripping 
tool's user interface suggests likely items for removal (see 
Fig. 8). 

User Services 

User services allow developers to build a desktop or office 
environment and control activities during a session. 

System Objects. Ill' Distributed Smalltalk supports a variety 
of system objects: user, session, clipboard, wastebasket, and 
orphanage. 
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Fig. 8. Interface fur lite III' Distributed Smalltalk stripping tool. 



' User object. This object contains inibnnaLion held about 
end users of the system including who they are, how to con- 
tact them, and so on. User objects may be included by refer- 
ence in oilier objects. For example, a user might include a 
business card in a memo that would enable the receiver to 
gel in touch with the sender. 

i Session object. All the information required about the stale 
of a user's environment, including user login, preferences, 
layout, and so on are contained in a session object. The ses- 
sion object also supports the notion of workspaces, with the 
potential for developing richer workspace environments. It 
hits no icon on the desktop but it interacts with and sup- 
ports other application objects. 

( Tipboard. This is a container for objects Uiat are being cut, 
moved, or copied from one location to another. 
Wastebasket. This container receives objects that users 
Que* away. The wastebasket can be cleared when it gets 
too full. 

Orphanage. This is a container for holding objects that are 
no longer needed. 

Security. Developers can use or extend IIP Distributed 
Smalltalk's access control services in the applications they 
build, setting controls for hosl systems, users, or both. Host- 
system access control lets developers determine whether an 
image can receive messages from another system. User-level 
access control lets a developer determine whether a given 
user has any one of several kinds of privileges ( e.g., read or 
write privilege) for a given object. 

Developers can administer access control programmal ically 
or from the default user interface. 

Example Code 

While all HP Distributed Smalltalk code is available to read, 
reuse, or extend, the default user interface and certain sam- 
ple applications may be the best place to start. 
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Fig. 9. The screen for presenting the office metaphor and some 
typical objects in an office. 



User Interlace. Ill' Distributed Smalltalk uses and provides 
support for a user interface based on rut office metaphor 
which is designed for easy use and underslanding. In the 
default user interface, all the objects a user works with lo- 
cally (folders, file cabinets, documents, and so on) are con- 
tained in an office. All offices on the same system are in the 
same building. Users can navigate between buildings to ac- 
cess objects in other offices. Fig. ft shows a typical office 
and some of the objecls available in an office. 

Sample Applications. Sample applications illustrate the use of 
distributed objects. For example, the Forum (Tig. 10) pro- 
vides a shared window in which several users can view and 
annulate a picture or document. The Notebook is a place tO 
store both local and remote objects on a desktop. 

I sers can also build I heir own objects from any of the sim- 
ple objects available, including a table, chart, input field. 



picture, and text window | see Fig. 11). The sample applica- 
tions can be extended and customized to create a variety of 
simple distributed applications. 

Creating Applications 

HP Distributed Smalltalk ;dlows VisualWorks programmers 
to create distributed applications quickly and easily. Building 
on die benefits of Smalltalk ;ind VisualWorks. HP Distributed 
Smalltalk users can build CORBA-compliant applications 
either from scratch or by modifying existing applications. 
Like any Smalltalk application, the distributed development 
process is iterative and designed for dynamic- refinement. 

Development. Distributed application development is a 
four-step process. 

L Design and test the application objects locally. 

2. Define the object interfaces and register them in the 
interface repository. 

3. Use HP Distributed Smalltalk's simulated remote testing 
tools (which actually use the ORB to marshall and unmar- 
shall object requests) to verify the interfaces specified in the 
interface repository. 

4. Track messages and tune performance. 

Distribution. Once an application is developed, tested, and 
tuned locally, it is easy to set it up for distributed use. 

5. Copy the application classes to the Smalltalk images they 
will run on. 

6. Update the interface repositories in these images. 

The application can then run in the fully distributed environ- 
ment without fun her change. Except for actual packet trans- 
fer, the distributed application is identical Co the simulated 
remote application developed, tuned, and tested during 

development 
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Delivery. Once the application is tested, developers can 
deliver it to their users by stripping the environment of 
unnecded objects and tools. Once stripped, the application 
looks exactly the saint* as applications developed in other 
languages and can be executed on any supported platform, 
including: HP-UX,* SunOS/Solaris, IBM AIX. Microsoft '- 
Windows, Microsoft Windows NT, or IBM OS/2. Support for 
these platforms is available under a run-time license from 
Hewlett-Packard. 
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A Software Solution Broker for 
Technical Consultants 



A distributed client-server system gives HP's worldwide technical 
consultants easy access to the latest HP and non-HP software products 
and tools for customer demonstrations and prototyping. 

by Manny Yousefi, Adel Ghoneimy, and Wulf Rehder 



On a typical working day an HI' consultant, one of thousands 
worldwide, sits down with a customer to solve a business 
problem. The challenge, die customer may tell the consultant, 
is to move sales data to headquarters more quickly so that 
management can make timely strategic decisions. For a 
solution, the consultant might propose a decision support 
system that integrates the customers older legacy system 
where the sales data has been stored traditionally with a 
faster "warehouse" database and easy access tools that 
present the information in just the form needed, right on the 
customer's desktop. "Let me show you what I mean," die 
consultant says, turning on a laptop computer (which had 
previously been connected to a LAN or telephone socket). 
Navigating through the windows on the screen, the consul- 
tant invites the customer to look through a virtual shelf 
filled with databases and access tools, all represented by 
icons, together with middleware and application develop- 
ment toolkits (see page 98). The consultant clicks on an 
[COO and the tool becomes immediately available for brows- 
ing or for self-paced learning. From here the consultant may 
show one of the demos that are included, or navigate the 
customer through a hypertext document to more information, 
alternate products, additional options, and prefabricated 
software building blocks. No wonder that this virtual soft- 
ware laboratory is called by HP consultants, "the software 
sandbox." This consultant is actually building — from the 
tool and product portfolio in front of them — a prototypical 
decision support system for this customer. How much of 
this is fantasy and how much reality? 

The answer is that it is all reality now. The soft ware sandbox 
that the consultant was starting to "play in" is called the HP 
Si ill w are Solution Broker (or Broker, for short ) and is avail- 
able now to HP consultants. Defining and creating a deci- 
sion support system is, of course, not play but serious work. 
However, the ease and immediacy of the Broker, the ample 
choices, and many helpful hints make even urgent business 
problem solving an experimental sport. Best of all. the con- 
sultant receives these products and tools, together with sup- 
port and on-line documentation, free of charge. For this con- 
venience, substantial research efforts had to be poured into 
building such a virtual software depot, using HP's own hard- 
ware platform and the most advanced object technology. 
Before explaining this Implementation more systematically, it 
is useful to watch our technical consultant and I he customer 
a I work. 



Using the Software Solution Broker 

To get a feeling for how the Software Solution Broker is used 
we will briefly watch the technical consultant show the cus- 
tomer how to build a prototypical decision support system. 

After clicking on the icon in the ORB control panel, which 
starts the object request broker (an action that in effect 
opens the lid covering the sandbox), the consultant activates 
the Software Solution Broker icon. Another window opens 
offering the Broker's classification of products, either by 
vendor, by technology, or by product name. (Alternative paths 
into the Software Solution Broker, such as a classification 
by business problem, are under development.) Choosing the 
information request) button for technology, the consultant 
asks whether the customer wants to see database informa- 
tion first or options for the user interface. As an executive, 
the customer is eager to see or build a nice GUI. Clicking 
on the graphic user interface i button brings up several 
choices of which three are shown in Fig. t. Having heard 
about VisualWorks the customer selects it and is presented 
with the VisualWorks Showcase. 

The consultant then shows a VisualWorks demonstration to 
explore with the customer what kind of data display win- 
dows, control buttons, analysis tools, and other features 
would lie appropriate. After jotting down these initial re- 
quirements the consultant is ready to build a first prototype 
The help button launches a palette of GUI building tools, 
and it takes only minutes to draw an example of a transac- 
tion entry' tool for the transactions underlying the decision 
support system the customer wants built (see Fig. 2). Here 
the customer interrupts and requests that the data be shown 
in spreadsheet form as well as graphically. They agree on 
bar chart and pie chart presentations for a first cut and pro- 
ceed to discuss the requirements for the underlying data- 
base. The Software Solution Broker has a "virtual shelf" of 
relational databases that work with VisualWorks, and among 
these the customer ma> have a favorite system, 01 BTI 
ready installed legacy database. They again discuss the pros 
and cons while viewing various product demonstrations. 

We meet the customer and the consultant again after an- 
other hour or so. By then the VisualWorks front-end tool 
displays some real data pulled from a database (Fig. '-i). At 
this point we leave the executive's office and describe how 
the Software Solution Broker is constructed. 
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Fig. 1. Software Solution Broker user interface. 



Constructing the Software Solution Broker 

Two considerations determined the archil eel ttre and conse- 
quently the implementation of the Software Solution Broker. 
First, since the products on the Broker have to he accessible 
worldwide but will be updated and maintained locally, lite 
global partitioning between distributed users and a central 
server functionality called for a client/server implementa- 
tion on a wide area network (WAN). Secondly, the need to 
accommodate many different types of clients and to be able 
to encapsulate many different products in the software 
server strongly suggested vendor independence (openness) 
and adherence to certain industry standards such as the 
Common Object Request Broker Architecture (CORBA). 

Software Substrate 

Here we will not focus on the WAN implementation but 
instead will concentrate on the software substrate on which 
the Software Solution Broker is built. In the software sub- 
strate (see Fig. 4) we include the entire software kit com- 
posed of server and client development tools, tools for 
building the client/server interaction components of the sys- 
tem, and repository tools. Repository tools are essentia] for 
the construction of a depot that contains the information in 
the system, including the logic for accessing tliis information. 
After a c areful technical analysis of five alternative complete 



substrate kits, Visual Works from ParcPlace Systems was cho- 
sen as the development software for the PC, UNIX" client, 
and UNIX server, while HP's Distributed -Smalltalk (see ar- 
ticle, page 85), which also works with VisualVYorks, was the 
tool of choice to build and manage the client/server interac- 
tion. All system infonnalion (e.g., documentation) at this 
lime of wtiting (release 2.0) still resides with the products 
and a central repository has not yet been chosen. Tools such 
as Object Lens (working with VisualWorks ) or HP Odapter 
make relational databases look like object databases, so we 
know that the selection of a repository can be made very 
quickly when needed. 

VisualWorks was the easy winner because it provides a com- 
plete environment for the development of true graphic appli- 
cations that nut unchanged on UNIX-system-based. PC. and 
Macintosh computers under their native windowing systems. 
Three of VisualWorks' features made it especially appropriate 
for the Software Solution Broken 

• VisualWorks is built on Smalltalk, a pure object-oriented 
language designed for fast modular design. 

• VisualWorks possesses a tested set of development tools, 
including browsers for object classes, a thread-safe debug- 
ger, and a change manager to track modifications to the 
code, as well as an inspector for use hi testing. 
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Fig. 2. A window within the .Software Solution Broker showing VlsiialWurks tools for prototyping a customer application. 



VisualWorks lias a large class library Of more iliaii 350 types 
Of portable Objects. THese include a rich user interface de- 
velopment toolkit suitable for all major windowing systems. 

Ill' Distributed Smalltalk extends VisualWorks' capability 
for developing standalone systems into an environment for 
creating distributed object systems (see Fig. 5) by adding 
the following: 

A full implementation of the object Management Group 

(i l.Mti) Common Object Request Broker Architecture 
(CORBA) core services 

Common Object Services for life cycle operations such as 
creating objects and the relationships between I hem 
Sample application objects, for example for the modular 
partitioning of client/server functionality into semantic and 
presentation objects. 

These objects and services for building distributed applica- 
tions are portable to all platforms supported by VisualWorks. 
Furthermore, ihey are compatible with the O.VKi CORBA 
standards. HP's Distributed Smalltalk provides seamless 
support of client/server interactions between VisualWorks 

OSF DCE is the Open Software Foundation's Distributed Computing Environment. 



images. CORBA compliance makes our Software Solution 
Broker implementation open and capable of interoperability, 
for instance with C++ CORBA-coinpliant applications, and 
as soon as IIP Distributed Smalltalk is < )SF IX E-compHant,* 
also with DCE remote procedure calls (RPCs). For the cur- 
rent release. TCP/IP or IIP Sockets are being used. 

Product Encapsulation 

Everybody who has worked with spreadsheets, word pro- 
cessors, or CAD systems knows that similar or identical 
functionality does not mean that the user interfaces and 
more generally the visual, iconic, and mental models are 
comparable. For the Software Solution Broker, too, each 
product has its own artifacts and idiosyncrasies, its own 
look and personality by which we can identify it when we 
see it in use or on the shelf of a vendor. This unavoidable 
fad poses challenges for the "virtual shelf of the Broker. 
Without wanting to blot out the individuality of a ventlor's 
offering it was the objective of the development team lo 
minimize the effort needed for the user lo gel accustomed I" 
this diversity. Cenerally speaking, the variety has lo be hidden 
behind a simple and consistent, product independent mode 
of access with uniform and intuitive graphical symbolism. A 
particular example is the double dick used consistently to 
launch an application. 
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Fig. 3. Prototype display for a customer application constructed using Kip Software Solution Broker lo select user interfacp and database 
software. 



Encapsulation, in the context of the Software Solution Bro- 
ker, describes a body of activities and software mechanisms 
that have two purposes: to integrate each product within the 
overall product portfolio so that the consultant can use it in 
its native mode, and to provide a uniform way lo access the 
products, their associated tools, and other artifacts. This 
accessibility, it should be noted, is restricted to the features 
and artifacts that are relevant to consulting work with the 



customer. This means the consultant can access editors, 
executable code, and documentation, but isn't able to change 
the internal product configuration, the way it is stored anil 
administered in folders, or the source code. Because of the 
intrinsic symmetry between Software Solution Broker serv- 
ers and clients (see Fig. 4) the encapsulation can be done 
either on the server side or on the client side, provided the 
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Fig. 4. Software Solution Broker 
software substrate, showing the 
client/server architecture, the 
user interface engine (Visual- 
Works), and the client/server 
framework (HP Distributed 
Smalltalk). 
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classes FolderPlusPO and EncapsulationOialog are present in the 
client These two classes will be discussed below. 

It is the already mentioned semantic/presentation split, to- 
gether with object-oriented features such as inheritance and 
polymorphism that make the encapsulation effortless. The 
semantic/presentation object distribution model is HP Dis- 
tributed Smalltalk's implementation of a distributed client/ 
server architecture. In this model, classes always appear in 
logical pairs, one representing the server semantics, the 
other their presentation in the client. Consequently, the class 
instances or objects also come in pairs. Take for instance 
the window object. Every window is composed of two log- 
ical parts: its shared (semantic) properties such as its rect- 
angular shape, and its local and personal (presentation) at- 
tributes such as color. In general, a semantic object often 
has (and controls) many different presentation objects, 
which in the case of the Software Solution Broker handle 
the remote user interactions, thus reducing network traffic. 
Fbr instance, one semantic data display object creates and 
conl rols different presentations of the data as a bar chart 
and a pie chart in a decision support system. IIP Distributed 
Smalltalk allows various unities of collaboration between 
the semantic and presentation objects, including messages 
that are handled by the object request broker. (For a simple 
but complete example see the HP Distributed Smalltalk 
User's Qttide, chapter 10.) 

After this abstract introduction of HP Distributed Smalltalk's 
semantic/present al ion split architecture we will describe in 
more concrete terms how it works for the encapsulation 
procedure. As stated above, encapsulation must achieve two 
goals: it has to present a graphical representation of the arti- 
fact (product, tools, demos, documentation) in its native 
mode to the remote client, and it must allow the remote user 
to launch the artifact at the server side through lliis repre- 
sentation. HP Distributed Smalltalk has a pair of classes, 
MediaSO and MediaPO, that accomplish exactly this. (The suf- 
fixes SO and P0 imply that semantic and presentation ob- 
jects, respectively, are spawned by these classes). Tracing 
the interaction diagram between two objects of these 
classes we found that there exists a ready-made method 
called updatePresenter, visible in the MediaSO class, that creates 
the remote presentation object of a product or other artifact 
in the server. To customize the generic MediaSO and MediaPO 



Fig. 5. HP Distributed Smalltalk 
and VisualWorks provide a full 
development and run-time 
environment for distributed 
computing. 

classes and the method updatePresenter for the encapsulation 
of specific artifacts we first created the narrower subclasses 
ArtifactSO and ArtifactPO. Then we augmented ArtifactSO with 
the attributes of artifacts such as vendor and product 
names. Finally, using overloading, we extended the method 
updatePresenter to include, among several other administra- 
tive tasks, the crucial behavior required for launching the 
artifacts while exporting their display to the client platform. 

Concurrent with litis architectural design of the classes and 
methods that bring about encapsulation in the Software 
Solution Broker, a few product dependent steps must also 
be taken. This is done at the instance or object level of every 
concrete artifact (such as a product) so that it will behave in 
its expected, native mode. This is a simple matter of insert- 
ing the right environment variables and parameters in an 
encapsulation dialog window. The required information can 
easily be gleaned from the installation manual of the particu- 
lar product that is being encapsulated. Finally, products, 
tools, and other components are put into folders and the 
encapsulation is done. 

I se of Object Technology 

The design and building of the Software Solution Broker 
were charac terized by a short development time, a minimal 
amount of new coding, and a high degree of reuse. The major 
reason is the application of object-oriented tecluiology. The 
object-oriented use is pervasive throughout the design, as 
indicated above, but it is helpful to point to specific exam- 
ples. We'll give two examples for the object-oriented fea- 
tures inheritance and polymorphism in the context of 
encapsulation. 

One of the examples has just been described: the subclass 
ArtifactSO of the class MediaSO inherited the method update- 
Presenter, which in turn, through the feature of polymorphism, 
was overloaded (that is, extended to include additional 
functional behavior). 

The encapsulation dialog window pn >\ ides another example 
As an administrative tool, it is not available to the user. It is 
an object built from a subclass of the existing IIP Distributed 
Smalltalk class called StmpleDialog. From this class, the win- 
dow inherits characteristics such as its property to pop up 
in front of other windows (it's not obscured), its basic layout 
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HP Software Solution Broker Accessible Products 



Vendors 

• Cognos Corp. 

• ParcPlace System 

• XVT Software 

• Itasca System 

• Informix 

• Neuron Data 

• Sybase 

• Unison Software 

• ProtoSofl 

• Oracle 

• Dynasty 

• NetLabs 

Tools 

HP-UX* 

• Cognos Corp. 

o PowerHouse 4GL 7.23 

• ParcPlace System 
o VisualWorks 2.0 

o VisualWorks with Sybase connectivity 

• XVT Software 

o XVT-Design (C Developer Kits) 
oXVT/XM(C Developer Kits) 
0 XVT-Power-H- 

o XVT/XM |C « Developer Kits) 
o XVT-PowerObject Pak I 

• Itasca System 

o ODBMS Server 
0 Developer Tool Suite 
o C Interface 
o Lisp Interface 

o API Libraries (C++. CLOS. Ada) 

• Informix 

o Informix Online R4GL 

o Informix WingZ 

o Informix SE R4GL 

o Informix SE ISQL 

o Informix Hyperscript Tools 

o Informix Online ISQL 

• Neuron Data 

C Smart Elements (Nexpert object) 
o Smart Elements (Openeditl 
o Open Interface Elements (Open edit) 
o CS Elements (Openeditl 

• Sybase 

o SA Companion (client & server) 
o SQL Monitor (client S server) 
o SQL Debugger inspector 
o SQL Debugger console 
o SQL Data Workbench 
o SQL APT Edit 
o SQPi Workbench (Easy SQR) 
o Open Client/Server 
o ISQL/SQL Server 

• Unison Software 
o Maestro 

o Load Balancer 
o Express 
o RoadRunner 

• ProtoSoft 

o Paradigm Plus 

• Oracle 

• NetLabs 

o Net Labs/AssetManager 



o NetLabs/Vision 

0 Netlabs/Assist 

o NetLabs/NerveCenter 

o NetLabs/Manager 

o NetLabs/OverLord Manager 

o NetLabs/Discovery 

MS Windows 

• Cognos Corp 

o PowerHouse Windows 1 2E 
o Axiant 
0 Impromptu 
Q PowerPlay 

• ParcPlace System 
o VisualWorks 2.0 

o VisualWorks with Sybase connectivity 

• XVT Software 

o XVT-Design (C Developer Kits) 
o XVT/Win (C Developer Kitsl 
o XVT-Power++ 

o XVT/Win (C ++ Developer Kits) 

o XVT-PowerObject Pak I for MS Windows 

• Itasca System 

o ODBMS Server 
o API Libraries (C++) 

• Informix 

o New Era 

• Neuron Data 

o Smart Elements (Nexpert abject) 
o Smart Elements (Openedit) 
o Open Interface Elements (Open edit) 
c CS Elements (Openedit) 

• Sybase 

o Net-Library 
o Open Client /C 
o SQL Monitor Client 
o SQR Workbench 
o APT Execute 

• ProtoSoft 

o Paradigm Plus 

• Oracle 

• Dynasty Technologies 
o Dynasty 

• NetLabs 

o NetLabs/Vision DeskTop 

Artifacts 

• Cognos Corporation 
o QUICK Application 
o QUIZ Application 
o POL Application 

o QDESIGN Application 

o QTP Application 

o QUTIL Application 

o PDL And Utilities Reference Manual 

o PowerHouse for UNIX - Primer 

• ParcPlace System 

0 Product Overview 

• XVT Software 

o Product Overview 
o XVT Design Tutorial 
o XVT Database DemD 
o XVT-Power++ Overview 
o XVT-Power++ Demo Guide 
o XVT Power ++ Earth Demo 

• Itasca System 



• Informix 

0 Product Overview 

o Informix R4GL Demo 

o Six demos with source codes 

o Informix ISQL Oemo 

o Informix Hyperscript Demo 

• Neuron Data 

o Product Overview 

o Notepad Widget example with source files 

0 Pack example with source files 

o Print widget example with source files 

o Resize widget example with source files 

o Resource Picker example with source files 

o Scripting example with source files 

o Scroll area usage example with source files 

o Scroll bar usage with source files 

o Sliders usage with source files 

o Special widget example with source files 

o String search example with source files 

o Text edit validation example with source files 

o Windows MD example with source files 

o Alert Windows example with source files 

o Browsex example with source files 

o Browsinc example with source files 

o Cbox example with source files 

o Chart example with source files 

o Clock Widget example with source files 

o C++ Notepad widget example with source files 

o Drag drop example with source files 

o Draw example with source files 

o DropDown pale example with source files 

o File manager example with source files 

o File name translator example with source files 

o File Picker example with source files 

o Floating window example with source files 

o Cantt chart example with source files 

o Help engine example with source files 

o Help viewer example with source files 

o ICON generator example with source files 

o List Box example with source files 

o Local drag drop example with source files 

o Menu example with source files 

o Multiple font text example and source code 

o Notepad example with source files 

• Sybase 

o SyBooks 
o APT Demo 

o Compute example with source files 
o Csr_disp example with source files 
o ilBn example with source files 
o blktxt example with source files 
o Five other examples with source files 

• Unison Software 

• ProtuSoft 

• Oracle 

• Dynasty 

• NetLabs 

HP-UX is based on and is compatible with Novell's UNIX* 
operating system. II also complies with X/Open s" XPG4. 
POSIX 1003.1, 1003.2, FIPS 151-1. and SVID2 interface 
specifications 

UNIX is a registered trademark in the United States and other 
countnes. licensed exclusively through X/Open Company 
Limited. 

X/Open is a trademark of X/Open Company Limited in the UK 
and other countnes 

MS Windows is a U.S trademark o) Microsoft Corporation 
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with a text box and O.K. and cancel buttons, and its link to a 
value holder that holds the environmental variables, names, 
and other information needed for the encapsulation. The 
only method needed in addition to inherited ones is the one 
requesting the encapsulation parameters mentioned above. 

The same procedure, that is, the use of predefined classes 
and thus minimal coding, applies to HP Distributed Small- 
talk's folders containing the encapsulated product with its 
tools and other artifacts. The HP Distributed Smalltalk class 
FoiderPO (PO indicates it is the folder class spawning presen- 
tation objects) has a method windowMenu, which creates a 
window with several pop-up menus that, have labels such 
as Action. Edit, and so on. For a subclass of FoiderPO called 
FolderPlusPO. these properties of windowMenu are inherited, but 
windowMenu is also changed (while keeping the same name), 
by the addition of a method artifactCreate and its lahel in one 
of the pop-up menus of windowMenu. The method artifactCreate 
is responsible for t he inner workings of t he encapsulation 
dialog window mentioned above. 

Development Methodology 

Funding for the Software Solution Broker project was subject 
to the condition that the development team find, jitsrify, and 
implement a design that brings the tools to the consultants in 
the fastest possible way with the least amount of resources, 
including development, maintenance, and support resources. 
At the same time, every released version, even the very first 
one, had to find immediate user acceptance. Based on these 
stipulations the team chose a development method that is a 
hybrid of iterative prototyping and the Fusion method. 

Our reasons for favoring iterative prototyping over a classi- 
cal software design paradigm that starts with a complete 
specification (such as the so-called waterfall model) were: 

• Time constraints. There are never enough engineer-months 
to write a complete specification, implement and test it into 
production strength.' 

• Constraints imposed by the intrinsic nature of the Software 
Solution Broker tool we were building, thai is: 

o Client-side usability. The GUI that was eventually chosen 
is the result Of repealed testing by potential users to 
achieve maximum ease of use and iiituiliveness, and this 
amount of trial-and-error cannot be specified in advance. 

o Tool accessibility. The different products on the virtual 
shelf have different behaviors and their own requirements 
for resources and administrat ion. and creating the encap- 
sulation process again requires much experimentation and 
gradual maturation based on experience that cannot be 
specified a priori. 

o Using the object paradigm. The software substrate chosen 
(IIP Distributed Smalltalk with VisualWorks) is well-suited 
for the rapid development of GUI and client/server 
applications. 

Based on these considerations, our overall approach was 
that of evolutionary prototyping, in which a fully functional 
prototype is ushered through repealed refinement sleps into 
a production-strength end product. We realize that often a 
prototype leads only to an executable specification or a vali- 
dated model, not a high-quality, stable product. However, in 
our case the sophisticated framework of IIP Distributed 



Smalltalk with its semantics/presentation split and Visual- 
Works with its Model Mew Controller ensured full function- 
ality and high quality at each refinement step because we 
reused the existing, high-quality code (including the library 
of classes) and very sparingly added new, thoroughly tested 
code, preferably as instances (objects) of the existing class 
library. 

Fusion Method 

While iterative prototyping can be seen as a software devel- 
opment philosophy that is primarily dictated by business 
requirements such as time to market, break-even time, or 
optimal return on investment, the Fusion method- was de- 
veloped with the goal of creating a language independent, 
comprehensive, software project management method. 
Being a systematic object-oriented development method, it 
blends well with our software substrate, which we chose 
based on openness, compliance with industry standards, 
ease of use. and the ability to separate the server (seman- 
tics) from the remote clients (presentation). The Fusion 
method emphasizes a modular design process in clearly de- 
marcated phases, so it synchronizes well with the iterative 
prototyping approach, which requires the repetition and 
refinement of certain development stages without impacting 
others. Furthermore, the Fusion method insists that a soft- 
ware development process of the complexity encountered 
today must cover the entire software development life cycle. 
The Fusion method's phased development process served as 
the blueprint for the Software Solution Broker. It can be 
summarized as follows- (our italics): 

Starting from a requirements document, the analysis phase 
produces a sel of models that provide a declarative descrip- 
tion of the required system behavior. The analysis models 
provide high-level constraints from which the design models 
are developed. The design phase produces a set of models 
thai realize the system behavior as a collection of interacting 
objects. The imi'lemenlatiou phase shows how to map the 
design models onto implementation-language constructs. 

In our hybrid approach we take an early, loosely defined 
functional prototype as our initial requirements definition 
(an executable specification), to be modified and refined in 
subsequent iterations through the three phases of analysis, 
design, and implementation. After each of these phases a 
review of the phase outputs is conducted by the develop- 
ment team in conjunction with users. The results of this 
audit are prioritized and, if deemed important, incorporated 
into the prototype which, through several of such review 
loops, evolves after a full cycle into the production product 
(For details about the outputs mentioned and the complete 
Fusion process breakdown see reference 2, especially 
Appendix A.) 

In summary, the two complementary methods of iterative 
prototyping and Fusion serve two main purposes. First, at 
the end of each prototyping cycle a fully functional produc- 
tion-strength product is released. Second, the three Fusion 
phases — analysis, design, and implementation — of every 
cycle are independent of the phases in another cycle. There- 
fore, we are in effect working towards sevetal releases at 
the same time (see Fig. li). 
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Fig. 6. Software Solution Broker development used iterative proto- 
lypiug and the Fusion method, resulting in parallel development 
cycles. 

Customizing the Software Solution Broker 

In addition to being a productivity tool and a hub of product 
expertise for HP's technical consultants, the Broker can be 
customized to meet the business needs of end customers as 
well. To sketch how such a customization can be clone using 
the ob ject-oriented framework of HP Distributed Smalltalk, 
imagine a vendor of CAD (computer-aided design) software. 
Rather than offering shrink-wrapped software packages on 
the shelves of the store the retailer wants to offer customers 
an environment where they can, by navigating through vir- 
lual shelves, choose interesting products and "test drive" 
them in the store before deciding what to buy. 

For an end customer such as the CAD software vendor, the 
Broker can be customized by mapping I he particular cus- 
tomer requirements into several levels of design complexity. 
These levels describe in technical terms what level of inter- 
vention into the framework of HP Distributed Smalltalk is 
needed to alter and customize the existing classes and meth- 
ods. On the lowest level, the requirements fit the IIP Distrib- 
uted Smalltalk framework exactly, and the system can be 
built from existing classes without change. A higher level of 
intervention would be needed lo construct the Software 
Solution Broker for the CAD software vendor. Slight modifi- 
cations of core services (relating to containment and life 
cycle semantics), in addition lo class augmentation and 
overloading of methods, would be recommended. Working 
with predefined, well-documented levels of intervention that 
are necessary to meet a customer's requirements has the 
advantage of communicaiing to the customer in advance, 
during the analysis and before system design begins, how 
much reuse of the framework is possible, and how much 
nonframework augmentation is necessary. Intervention lev- 
els are thus not only technical assessments but also indica- 
tors of the final costs for the system. 



Conclusion 

The Software Solution Broker was not a typical client/server 
application development project We were not primarily con- 
cerned about two-tier or three-tier architectures, about ob- 
jects per se, about the one "right" programming language, or 
about coding. In fact, we went the opposite route. Based on 
the working requirements of HP's technical consultants and 
our own analysis of how consultants work with customers, 
we resolved to translate these requirements into a system 
built from distributed objects. The building, however, con- 
sisted mainly in the skillful choice of existing classes and 
the exploitation of HP Distributed Smalltalk's framework. 
The novelty in our approach lies not in the coding of new 
sumctures. but in the extensive application of reuse, hi fact, 
whenever new code seemed required, we took it as a warning 
that further analysis was needed to look for prefabricated 
code within the framework of HP Distributed Smalltalk. 
This simple principle, essential for a fast time to market, 
also guaranteed a short turnaround time and high quality. 

Through its first Iwo releases, 1.0 and 2.0, the Software Solu- 
tion Broker can be viewed as a distributed productivity tool 
offering three overlapping types of services. These three 
types can be described metaphorically as a virtual software 
shop for the display of individual products, a consultative 
workbench or simulated classroom for studying and experi- 
menting with several collaborating products, and a virtual 
demo center with remote satellite offices where the techni- 
cal consultants can build prototypes and create demos for a 
customer. Looked at from a broader perspective, however, 
the Software Solution Broker architecture and implementa- 
tion are. with small customization, also ideal for other, re- 
lated applications that require one (or a few) persistent cen- 
ters and many locally distributed and individually presented 
clients. One example is software distribution. Another is the 
establishment of a worldwide software application develop- 
ment lab where each satellite group can develop its own 
pail locally, check it in with a central repository where il is 
available to the other satellites, and participate remotely In 
the integration of the parts into a system. Furthermore, ob- 
ject technology, with its concept of containers, makes avail- 
able compound documents (text, picture, voice, video, etc.) 
that can be employed also on the nontechnical side of busi- 
ness as vehicles for elaborate project proposals and other 
communication with business customers — for instance, to 
propose a solution by showing a video of a prior, successful 
installation (this would lake the place of a paper document 
of reference sites). In this role, the Software Solution Broker 
can be a ivoridwide business solutions exhibit and a conve- 
nient repository for a portfolio of repeatable solutions from 
which the customer, advised by a consultant, can select 
products the way we now choose from mail-order catalogs. 
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Bugs in Black and White: Imaging IC 
Logic Levels with Voltage Contrast 



Voltage contrast imaging allows visual tracking of logical level problems 
to their source on operating integrated circuits, using a scanning electron 
microscope. This paper presents an overview of voltage contrast and the 
methods developed to image the failure of dynamic circuits in the 
floating-point coprocessor circuitry of the HP PA 7100LC processor chip. 

by Jack D. Benzel 



As pressure for higher performance and higher integration 
drives integrated circuit design towards increasing complex- 
ity, K ' designers need an ever-broadening sel of analysis and 
debugging tools and methodologies lor Hacking down func - 
tional bugs and electrical margin issues in their designs. 

In developing the new HP PA 7100LC PA-RISC micropro- 
cessor chip, the float ing-poinl arithmetic logic unit (FPALU) 
megacell used design techniques based on the PA 7100 de- 
sign. 1 The FPALU design is implemented with mostly 
mousetrap-style dynamic logic- with significant use of 
single-ended dynamic logic in the last pipeline stage. 

Past experience in debugging electrical problems in mouse- 
trap designs has shown these problems to be very difficult 
to find.'' A failure mechanism that emerged in prototypes of 
gate-biased PA 7100LC FPALUs proved highly challenging 
and evasive and required a large engineering effort to get 
from detection lo the root cause identification. The voltage 
Contrast imaging methodology proved useful in analyzing 
and later confirming the root cause of the failure mecha- 
nism. Results from the analysis allowed us to correct the 
design and verify its quality. 

The Wall 

The FPALU failure mechanism was named "the wall" because 
of its appearance on a frequency-versus-voltage shmoo plot 
depicting regions of passing and failing vectors (see Fig. 1 ). 

Considerable engineering resources were applied toward 
finding the root cause of the wall using many of the tech- 
niques that had proved successful on previous design 
projects, including but not limited to shmoo plots, failing 
vector/opcode analysis, clock phase stretching, focused ion 
beam ( FIB) experiments, and simulations of probable circuit 
failures. 1 ' These techniques were not providing enough infor- 
mation, and a new methodology was clearly needed. 

Why Voltage Contrast? 

Another HP design team had recently had success in using 
an electron-beam prober 4 to track down the root cause of a 
noise problem on I he same CPU cliip. 

Previous experience with another project several years ago 
provided insights into a methodology similar to electron- 
beam probing called voltage contrast, using a scanning 



electron microscope (SEM). After considering the various 
tradeoffs it was decided to proceed with the voltage contrast 
imaging while keeping open the option of going to electron- 
beam probing if further analysis was required. 

SEM Fundamentals 

The SEM displays objects by sensing and imaging the release 
of secondary electrons from the surface of a sample which 
is held in a very high vacuum. A finely focused beam of elec- 
trons accelerated from an electron gun with a thousand-volt 
potential is swept over the surface of the sample in much 
the same way that a television screen is scanned. As the 
liigh-energy electrons in the beam strike the sample, sev eral 
valence electrons will be "knocked loose" from the sample 
as the impinging electrons lose energy. These now-free elec- 
trons, or secondary electrons, find their way to the surface 
of the sample and are released from the surface. A highly- 
biased metal screen situated near the sample collects escap- 
ing secondary electrons into a detector which generates a 
signal proportional to the number of electrons collected. 
The signal from the detector is amplified and displayed on a 
CRT screen which is scanned in synchronization with the 
electron beam sweeping the sample. 
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Fig. L ShtnOO plot of "the wall." 
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Fig. 2. 00 First pass "charge" of IC surface with secondary electrons, 
(b) Second pass "read" of charged surface (bottom), resulting video 
signal (middle), and 2D video image (top). 

Voltage C'oiil rast Imaging 

Voltage contrast imaging uses the electrical nature of the 
SEM in view voltage potentials on a sample changing with 
lime. Figs. 2a and 2b show a cross section of I he top two 
melal signal layers of an IG with the metal lines insulated l>y 
an oxide. 

The imaging is done in two stages: charging and reading. 
Fig. 2a 

shows ihe slale of I he IC al the end of the charging Stage. 
The positive potential of Ihe buried melal lines allracls and 
holds the generated secondary electrons on Ihe surface of 
the OXide above the metal lines. These charges will remain 
on (he surface for long periods of lime, basically acting like 
a capacitor 

Fig. 2b shows Ihe stale of Ihe IC al the end of Ihe read stage 
wilb Ihe voltage potentials Of the metal lines now changed, 
The resulting detector signal level and Ihe CRT image gener- 
ated from il are also shown above the cross section. As Ihe 
elect ron beam sweeps the surface of the sample, the elec- 
trons that were once held by the positive charge of the upper- 
left and lower metal lines (Fig. 2a) ate knocked off Ihe sur- 
face and are collected into the detector, generating a bright 
signal on the I HT. On the other hand, the upper-right metal 
line is now - more positive, and the surface above il will re- 
lease fewer sci-ondary electrons as ihe surface capacitivcly 
charges, corresponding to a lower number of electrons 
collected and thus a darker signal on Ihe t RT. 




Fig. 3. Video image of ill IT fixture for voltage contrast setup with 
top shield removed. 

DUT Preparation 

Preparing ihe IC for the SKM environment required careful 
attention to several details as follows: 

• Clean Power Environment. Some previous experiments 
indicated thai the wall was somewhat remedied by a power 
environment thai restricted the Vrni current supply. There- 
fore, careful attention was paid lo provide adequate low- 
inductance power feeds with adequate decoupling 
capacilance. 

• Simple Vector Stimulus. Restricted cabling int o the SEM 
chamber and easy portability between two different SEM 
facilities required a simple method for executing a wall- 
sensitive floating-point operation (FL< )P). A successful 
method was developed to launch and step through Ihe 
phases of a FLOP using the .ITACil-conforming serial test 
port and a serial lest board. 

• limine Capture Synchronization. 'Die capture and imaging 
of events on Ihe SEM system requires a synchronizing signal 
generated by the device under lest (DI'T). Several small 
surface mount ICs were mounted on Ihe PA 71U0LC package 
to decode the clock signals and derive another synchroniz- 
ing signal lo provide Ihe SEM with an accurate sync pulse 
that identified the leading clock edge al the starting phase 
or the failing PLOP. 

• Minimize Outgosstng, To achieve an adequate vacuum in 
the SEM system, materials thai had minimal oulgassing were 
required. This prevented the use of heatsluirik tubing and 
qUiCk-CUre epoxies and required careful cleaning of the 1)1 IT 

• Packaging. The packaging fixture containing the GPU (see 
F'ig. 3) met several requirements. The wall was a high- 
temperature phenomenon and required heating the part 
inside of the SEM with large resistors mounted inside the 
fixture. The melal enclosure shielded all but the die surface 
from the electron beam, since the beam will positively 
charge plastics (wiring, capacitors). The shield also pre- 
vented electrical signals in Ihe 1)1 "I" wiling from interfering 
with Ihe beam's trajectory. The last requirement filled by the 
fixturing was a compact size to lit inside the small SEM 

chamber. 

t JTAGisthe Joint Test Action Gmu|i. which developed IFEF standan! 1149 I. lt(E Test 
Access Port ml BoutKtoiy-Scan Aichttectuie 
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Fig. 4. Beam blanking and synchronization signal generation. 
Imaging Dynamic Signals 

The electron beam scan is synchronized with the scan of the 
video display lube and consequently has a slow refresh rate 
of 1/60 second. This slow refresh rate works well for station- 
ary objects and static electrical signals, but the signals of 
interest involved in the wall failure are dynamic, typical of 
mouset rap designs. The imaging of dynamic signals required 
the development of a new process. 

Synchronization and Beam Blanking 

The slowest rate at which the DIT could be clocked with 
reliable operation of the scan path driven through the JTAG 
port was 2.G MHz, giving a 200-ns phase or period during 
which dynamic signals would be activ e. Connecting a pulse 
generator to the DL'T's sync pulse allowed the generation of 
a variable- width, variable-delay pulse (see Fig. 4) which was 
used to blank the electron beam scanning the DIT. Using 
this blanking signal, die SEM could be controlled to charge 
or read the IC only during the rime of interest when the wall- 
related signals were active. A 100-ns sample window was 
chosen for the blank signal, which was centered in the dock 
phase to reduce possible overlap into adjoining phases. 

Once the beam was properly synchronized and blanked, the 
apparent lack of information in the video image shown in 
Fig. 5 gave a strong indication thai more development was 
needed. 

Image Capture 

The next problem to resolve was imaging the brief 100-ns 
video information successfully. Several ideas were evaluated 
and tried before an acceptable method was found: 
Photographic Film Integration. The SEM focuses the light 
from a secondary CRT onto the film plane of a Polaroid 
camera over a period of several minutes while exercising 
the DIT. This method resulted in either completely black or 
very indistinct images of die IC. 

Tico-Dimensional Scan. The SEM can operate with basically 
a zero-frequency vertical scan rate. This provides an image of 
a single horizontal slice of the IC surface while improving the 
refresh rate. Changes in beam intensity were (indiscernible 
in this mode. 



• Tiio-Dimensioitnt Scan &t Oscilloscope Mode. Using the 
same two-dimensional scan mode as above, the intensity 
vector of the SEM's display can be used to drive the vertical 
component of the video signal. Tin- resulting image is remi- 
niscent of an oscilloscope display showing intensity on the 
y axis. No discernible changes in intensity were visible in 
this mode as well. 

• Tivo-Slep Charge/Read. Instead of trying to charge and read 
on each or every other FLOP, the process was broken into 
two steps. The first step involved turning the beam on only 
during the phase of interest while the part was executing 
wall FLOPs over a period of tlvree minutes. A long integra- 
tion time was required because each time the beam turned 
on it only charged a tiny area of the field of view. At the end 
of the integration time, the beam was turned off. the IC 
powered down, and the beam blank removed from the SEM. 
The IC now had a surface charge that reflected the state of 
the metal lines during the phase of interest. The second step 
was to turn the beam on with no blanking to read the sur- 
face charge in its first pass over the IC. The resulting video 
image was clear but brief (one video frame). This process 
produced an image in which metal lines with a positive volt- 
age were white and metal lines at ground were black. An- 
other small variation in this process was not to power down 
the pari before the read step. The resulting image took a 
little more thought to interpret because only the metal lines 
that changed stale from the previous step were black or 
white. 

• Two-Step Charge/Read with VCR Frame Capture. By add- 
ing a VCR to the setup, the resulting video image fed to the 
CRT coidd be captured on tape and then freeze-framed for 
viewing. The purchase of a VCR with a forward and reverse 
single-frame jog shuttle control greatly aided in isolating the 
image captured on a single frame. It was apparent from the 
videotape that the majority of the IC's surface charge was 
removed in the firs! sweep of the beam across the die area. 
This last methodology was used successfully for imaging 
the dynamic signals in the FPALU. 

Results 

Once the methodology was established, over 120 images 
were captured and catalogued on video tape over a four- 
week period. Several clays were spent at the outset trying to 
understand why an active clock line in the imaged phase 




Fig. 5. Video image of the first-pass imaging attempts. 
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Fig. 6. Ze'ojopH mousetrap buffer. 

was not showing activity, a key indicator that the proper 
phase of the FLOP was being captured. This issue was never 
satisfactorily resolved, yet phase-by-phase clock gating in 
the FPALL' ensured that the signals w ould only be active 
and thus visible in the phase of interest. 

Figs. 6, 7a. and 71) show the schematic, artwork, and voltage 
contrast image of probably the clearest failure identified. 
The circuit in Fig. 6 shows a mousetrap buffer whose stor- 
age node, si, was somehow being compromised, possibly 
through a ground differential problem or a noise spike on 
the input. 




Output 




(al 




Circle A in Fig. 7a identifies the buffers input on the left and 
the output on the right- The expected value of each metal 3 
line is indicated above the lines (X=Low. H=High). 

Fig. 7b shows a voltage contrast image captured from the 
videotape showing the failure of the buffer. The image 
dearly shows a low level on the input (black) and a liigli 
level (white) on the output of the buffer in circle A. Note the 
difference between circle A and circle B which identifies the 
input and output of an identical buffer with no failures. It 
became clear from this picture that the electrical event that 
caused the buffer to output a high level was transitory in 
nature and not a static event. The read step of the image 
was taken with the IC powered down. 

Metal 1 and even metal 2 lines can be difficult to image unless 
they are well-isolated from other metal structures. Fig. 8a 
shows the artwork and expected values where several metal 
1 lines were imaged. The vertical metal 1 route in circle A 
should have a high or white level, and the route to the right 
of it in circle B should have a low or black level. 

Fig. 8b is the voltage contrast image showing the logical 
misfiring (high/white ) of the metal 1 route in circle B. This 




(bi 

Fig. 7. (a) Metal :< plot nf ZerujupH buffer with failing input/output 
pair A. (b) Voltage contrast Image of Victimized buffer With failing 
toputfoUtpul pair A. 



lb) 

Fig. 8. (a) (SIABCDI bus artwork, (b) Voltage contrast image or fSIABCDI 
bus in metal 1 showing correct firing of the lines in circle A and I he 
incorrect Bring of lines in circle H. 
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Fig. 9. (a) Metal 3 si met ure( vertical routing) in passing state at nominal voltage. Horizontal routes are metal 2. (b) Metal 3 Structure in 
failing stale al high voltage with wall failure. 



failure was not seen until the root cause of the wall was 
identified and the proper FLOP for arming the failure was 
identified. 

The logical slates of individual lines of dense bus structures 
in lower metal levels can be difficult to discern, yet differ- 
ences between two states can often be readily identified. 

Figs. 9a and 9b illustrate Hie differencing technique with an 
example of a metal 3 Structure in both a passing and failing 
state (note the differences iii the vertically routed lines in 
the top-center of the figures). The bend or distortion in Fig. 
9b is the result of poor synchronization between the SEM 
and the VCR that recorded the images. Note also the 
changes in I he horizontally routed metal 2 lines. 

One technique that greatly aided the interpretation of the 
captured images was to plot the artwork of the areas being 
imaged and annotate the plots with the expected logical 
levels as derived from a simulator. 

Improvements and Future Use 

It is difficult to determine if E-beam probing would have 
provided quicker, more pertinent information than voltage 
contrast. Each lool has its own benefits and drawbacks that 
the IC designer must weigh in light of the problem to be 
solved. 

Additional l< ' physical struct tires and layouts could make 
new designs more amenable to voltage contrast imaging as 
well as E-beam probing and FIB experiments. These features 
could provide regular, systematic, top-level-metal access to 
control and data path signals throughout the design. Top- 
level-metal access could be provided through directed routing 
or through "\ia stacks" to top layers from lower-level metal 
routes. The efficiency of such features in terms of improved 
accessibility versus increased layout area is unknown. 



The image quality obtained from the SEM for voltage contrast 
work could be improved by changing the election gun fila- 
ment from tungsten lo a crystalline element. The crystalline 
filament would increase the beam current and thus effec- 
tively provide a brighter image without increasing the beam 
energy which reduces resolution. 

Conclusions 

The use of voltage contrast imaging proved to be a useful 
tool for analyzing and verifying the FPALU margin failure 
known as the wall. Although the information gleaned from 
the process did not lead directly lo the discovery of the root 
cause of the failure, the voltage contrast process functioned 
well as a due generator as suggested in reference 3 and pro- 
vided imponant confirmation of the root cause hypothesis. 
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Component and System Level Design- 
for-Te stability Features Implemented 
in a Family of Workstation Products 

Faced with testing over twenty new ASIC components going into four 
different workstation and multiuser computer models, designers formed a 
team that developed a common system-level design-for-testability (DFT) 
architecture so that subsystem parts could be shared without affecting 
the manufacturing test flow. 

by Bulent L Dervisoglu and Mic hael Ricchetti 



Members of the latest-generation family of IIP workstation 
and multiuser computer products use the same system archi- 
tecture and differ mostly in their I/O subsystem architecture 
and configuration. From a system development point of view 
an important characteristic of these products is their use of 
a new high-speed system bus architecture and a large num- 
ber (over 20 ) of new ASIC components that were developed 
to implement all of the various different configurations of 
the product line. Furthermore, all components that interface 
with each other via the system bus are required to operate 
with the same high-frequency system clock. 

a farther difficulty was that lour different models, ranging 

from a single-user desktop workstation to a multiuser com- 
puter, were being developed by different design teams diat 
were both organizationally and geographically separated 
from each other. This made it necessary to develop a com- 
mon system-level design-for-testability (DFT) architecture to 
be used throughout the system and across the different com- 
puter models so that subsystem parts could be shared 
among the different computer models without alTcctiug the 
manufacturing test flow. 

To address these difficulties a DFT core team was formed at 
the very early stages of the project Bec ause of the large 
number of different ASIC teams involved, it was decided 
that all ASK ' teams at the same site would be represented by 
a single representative on the DFT core team. This team has 
been instrumental in achieving goal congruence among the 
different design teams and manufacturing organizations. 
Furthermore, the presence of the DFT core team made it 
possible to develop and implement a DFT methodology that 
was used by all of the ASIC" teams, although the level of ad- 
herence varied. The DFT core team also collected data and 
performed DFT design reviews for some of the ASICs. 

ASIC DFT Design Rules and Guidelines 

One of the fast activities of the DFT core team was to de- 
velop a set (if design rules and guidelines to be followed by 
the ASIC design teams to ensure that DFT features would be 
Common among the various components. This made it pos- 
sible to share efforts and results and to access the different 



DFT features in the ASICs during prototype system bring-up. 
The following is a summary of these rules. 1 

1. All (functional ) system cluck* must be directly contml- 
lable from the chip pins and must not be used for an// other 
function. All systems use a common ASIC component (the 
system clock controller ASK ! ) to drive their clock terminals 
on the system board. This ASIC has control pins through 
which it can be programmed for different clock generation 
schemes as well as for stalling anil halting the system clocks. 
Thus, not only the individual ASICs but also the entile system 
board has directly controllable clocks. 

2. .4// scan mill lest clocks must be directly conlmlliible 
from the component pins, which must nut be used fur iinij 
oQtet purpose On the system board all lest clucks are tied 
together and controlled from a single test point. 

:i. For each ASIC there is a specific reset slate which is 
entered when the component's ARESET_L Signal is asserted. 
On the system board, the power-on condition is delected 
ami is used to reset the ASICs to a known starling slate. 
Next, the memory controller ASIC generates an SRESET_L 
signal to all other components on the system bus. Additional 
reset signals are generated by other ASICs for use locally. 

-I. Alt ASICs must implement a dedicated boundary scan 
register and its associated test access j>ort (TAP) as specified 
in IEEE 11 W.I Standard Test Access Port and Boundary 
Sean Architecture.* Serial scan-in and scan-out ports of all 
ASK s in the system (including Ihe PA "2110 processor, which 
is on a separate module) are connected to form a single 
serial scan chain. 

5. Access lo each ASIC's on-chip test functions must be pro- 
vided using Ihe IEEE 1WU test access port (TAP) protocol. 
The same TAP controller design ' is used or heavily lever- 
aged in many ASICs. This way. lest features implemented tn 
this controller as an extension lo the IEEE 1 Ml). 1 standard 
were easily leveraged across different ASK s. For example, 
Ihe DRIVE JNHIBIT/DRIVE_ ENABLE instructions and Ihe OUT.OFF 

bit in the boundary scan register (see TAI'/SAI' Controller," 
below) are duplicated in different ASICs in this way. 
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•i. All ASICs shall be designed to support IppQ testing, 
Whenever this is nut prevented ti/j the technology used. In 
most cases this requirement did not present any further de- 
sign constraints or changes. In a few cases, an internal 
IpDQ enable signal had to be used to disable active pull-up 
and pull-down circuits. However, because of schedule and 
cost considerations the PA 7200 processor chip does not 
support Ippy testing. 

7. All ASICs shall implement internal scan fur testing. The 
percentage of internal nodes thai are sea unable shall be 
kept as high as passible without sacrificing major chip 
area or otherwise affecting the design methodology. For 
most practical purposes all ASICs have implemented inter- 
nal scan for 100% or nearly 100% of all internal flip-flops. 
However, because of design style and technology differ- 
ences, some portions of the PA 7200 processor chip are not 
scannable. 

8. There shall be no asynchronous logic implemented in the 
ASICs. Lack of asynchronous logic is an important require- 
ment for many CAD tools for generating test vectors. Fur- 
thermore, this rule is intended to prevent side effects caused 
by changing the internal and external signals in arbitrary 
sequence. The only exception to litis rule is granted for (he 
reset signals, which are implemented lo follow a carefully 
planned system reset strategy. 

The following sections describe some of I he DFT features 
thai have been implemented in the ASICs. Nol all features 
are implemented in all ASICs. Among the various ASICs, the 
memory controller stands out as the chip with I he most 
extensive DFT features. 

TAP/SAP Controller 

Access to all on-chip DFT features is implemented through a 
test controller block called the test access port/scan access 
port (TAP/SAP). The test controller implements all of the 
required instructions for the IEEE 1 140.1 TAP controller as 
well as ;ui exlensive set of public and private instructions 
which are targeted mostly for internal testing of the ASIC. 
Table I lists all of the TAP insl ructions that are implemented. 
Among the public instructions that have been implemented 
are the DRIVEJNHIBIT and DRIVE JNABLE instructions which 
are used to set and clear a latch in the system logic domain 
(not considered part of the test logic). 

System logic for all ASICs has been designed such that for 
normal system operation (i.e.. when test logic is not con- 
trolling the I/O pins) the ASIC can drive out only if the 
DRIVEJNHIBIT lalch is cleared. Each ASIC uses its ARESET_L 
input to clear the DRIVEJNHIBIT latch during power-up. 
Whereas ARESETJ. controls the DRIVEJNHIBIT lalch only if the 
TAP is in a reset state, explicit TAP instructions can be used 
at other times to set or dear this latch. This scheme allows 
in-cireuit ATE programs to set the DRIVEJNHIBIT latch before 
they terminate and reset the TAP without creating possible 
board-level bus contention before removing electric power 
from the board. Whereas the DRIVEJNHIBIT latch is consid- 
ered pan of the on-chip system logic, il is implemented as 
part of the TAP controller design so that ASIC designers 
implementing normal system functions do not have to deal 
with any of the issues surrounding the DRIVEJNHIBIT and 
DRIVE_ENABLE operations. 



Table I 
TAP Instructions 



Instruction 


Drive 1/0 Pads 


Scan Register 


EXTEST 


Boundary Register 


Boundary 


BYPASS 


System Logic- 


Bypass 


SAMPLE/PRELOAD 


System Logic 


Boundary 


IDC0DE 


System Logic 


ID Code 


HI_Z 


High-Impedance 


Bypass 


DRIVEJNHIBIT 


Boundary Register 


Bypass 


DRIVE.ENABLE 


System Logic 


Bypass 


SCANJNTERNAL 


System Logic 


f(Mode) 


CHIPTEST 


High-Impedance 


f(Mode) 


INTEST 


Boundary Register 


Boundary 


DRJ3CAN 


System Logic 


r('Mode) 


SELECT _M0DE 


Boundary Register 


Mode 


SET_M0DEJ3IT 


Boundary Register 


Mode 


CLR_MODE_BIT 


Boundary Register 


Mode 


ISAMPLE 


System Logic 


Bypass 


ESAMPLE 


System Logic 


Bypass 


DS_DRIVE 


Boundary Register 


Boundary 


DS_RECEIVE 


System Logic 


Boundary 



Other TAP instructions are used to set and clear bits Of the 
mode register to provide access to additional test features 
such as IrjrjQ testing, double-strobe, and so on. It is also pos- 
sible to speed up internal scan operations by switching on the 
parallel scan bit in the mode register. This feature enables 
multiplexing of the chip's I/O pins to perform serial scan-in 
and sGan-6ul of the internal scan register by breaking it into 
i hive independent sections which are scanned in parallel 
together With the boundary register, Which is always scanned 
using the test data in and test data out pins of the TAP. 

CHIPTEST Instruction 

One of the major difficulties in implementing DFT in the 
ASICs used for this project has resulted from a common 
leveraged I/O pad design that contains nonscannablc latches. 
Furthermore, the bidirectional I/O cell implements an internal 
bypass path to feed into the chip the same value that is being 
driven onto the I/O pad by that chip. In effect, I/O pads con- 
tain nonscannable pipeline stages that control both the di- 
rection and the value of data on the I/O pad. Following a 
recommendation from the DFT core learn the basic I/O cell 
design was modified to allow data values received by the 
on-chip system logic to be set up using the dedicated bound- 
ary scan register. In addition, system logic output values can 
be captured into the boundary scan register using the system 
clock. These design changes were coupled with features 
provided by the CHIPTEST instruction in the TAP controller to 
streamline the internal testing of the ASICs. For example, all 
internal logic of the memory subsystem ASICs (memory 
controller, slave memory controller, and data multiplexer) is 
tested by the following sequence: 

1. Load the CHIPTEST opcode into the ASIC. 

2. LIse test clocks to perform a parallel scan of the ASIC 
internal nodes and the boundary register. At the end of the 
scan-in process the newly scanned-in values are automati- 
cally mov ed from the boundary register to the nonscannable 
latches in the I/O-cells. 
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-i Apply a single system clock to capture test results in in- 
ternal nodes and system logic output values in the boundary 
register. 

4. Repeat steps 2 and 3 for each new vector, overlapping the 
scan-in and scan-out operations. 

Since the CHIPTEST instruction drives the L'O pins to a high- 
impedance state it is possible (indeed it is intended ) to exe- 
cute these tests on a populated system board without fear of 
creating board-level bus clashes during such testing. 

BIST Implementation 

The memory controller ASIC incorporates several wide and 
shallow register Tiles that are used for queueing operations 
within the data paths. The total number of storage elements 
in the register files is quite large, so it was not practical to 
make these storage elements scannable. Therefore, a built-in 
storage test (BIST) approach was chosen to test the memory 
controller data path register files. 

The memory controller BIST implementation was developed 
with the following objectives: 

• Pro\ide high coverage and short test times. 

• Provide at-speed testing of the register file structures to 
ensure that the memory controller ASIC works at the re- 
quired system c lock frequency. 

• Provide flexibility and prograniniabilily in the BIST logic to 
allow alteration of the test sequence for debug and unfore- 
seen failure modes. In particular, the system bring-up and 
debug plans provide a means for system-level scan access 
to the slate within the ASICs. Providing these features 
allows read/write access to the nonscannahle queue states 
for prototype system debug. 

• Provide for testability of the logic surrounding the register 
files through added observation and control points at the 
inputs and outputs of the register file blocks. This is intended 
to support automatic test pattern generation (ATPG) tools 
used to generate lest vectors for the memory controller and 
thus ensure high coverage of the standard cell control logic 
for the queues. 

The design of the BIST logic in the memory controller data 
paths is based on previous work that was done for the PA 
7100-hased HP 9000 Model 710 workstation. For that product, 
a Struct urc independent RAM BIST architecture that uses a 
pseudoexhaustive test algorithm and signature analysis was 
developed and was implemented in the I/O controller ASIC.' 1 
The si met ure independent, pseudoexhaustive lesi algorithm 
provides 99.9% fault coverage of typical RAM faults and can 
provide 80% to 99.9% coverage of neighborhood pattern- 
sensitive faults. It also allows the test time (number of read/ 
write accesses per memory address) to be varied according 
to the desired fault coverage. BIST architectures for both 
the present memory controller ASIC and the previous I/O 
controller ASIC use a lesl algorithm similar to that described 
by Hitler and Schwair.'' I "sing the system clock for BIST exe- 
cution, the RAM structure can be tesied at the normal system 
clock rate, thus providing at-speed testing of the RAM. 

A dual-port wrile/single-port read register file from the pres- 
ent memory controller data path, with tesi structures thai 
provide both BIST and ATPG support similar to the previous 
l/( i controller BIST architecture, is shown in Pig. 1. The two 
write ports, A and B. can both be addressed and written 



independently. The single read port can also be addressed 
and read independently of the A and B write ports. Thus, 
two write operations and one read operation can all occur 
simultaneously for one to three register locations, depending 
on the A. B. and read port addresses. 

Given the dual-ported design of the memory controller reg- 
ister files, it was necessary to extend the previous L'O con- 
troller BIST architecture to test a dual-ported RAM. This 
meant thai the memory controller BIST implementation 
should be able to test not only the simultaneous dual-write 
operations but also the various combinations of A/B write 
and read operations to verify that the port interactions are 
working correctly. For the dual-port register files in the 
memory controller such interactions include an internal 
bypass when the read address is the same as either of the A 
or B write addresses and a B-port dominant write when the 
A and B write addresses are equal. This dual-write BIST 
algorithm is described in reference 6. 

For the register file shown in Fig. 1 each of the BIST struc- 
tures— LFSR (linear feedback shift register), SHIFT, COUNT, 
and MISR (multi-input signature register) — is dedicated to 
BIST Each register file also has its own dedicated program- 
mable BIST control queue for sequencing the BIST algorithm. 
The BIST_M00E signal enables the BIST functions and can be 
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Fig. 1. Dual-port register file with built-in storage Cut (HlsTiand 
automatic test pattern generation fjitrpQ) features. The inputs to 
the central, embedded HAM structure art- provided b\ multiplexing 
between the normal syslem value and a HIST register, which Is Im- 
plemented as a linear feedback shift register I LFSR). The output 
multiplexer makes il possible in capture the outputs into a multl- 
Input signal ure register (MINK) and to send either the RAM outputs 

or the misk contents to the rest of the system. 
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controlled either by a pin on the chip or through other test 
access logic such as an IEEE 1149.1 TAP controller or 
PI 149.2 SAP controller. 2 ' 7 All of the BIST registers are Imple- 
mented in standard cell blocks separate from the data path 
register files. A detailed description of the memory conlroller 
BIST implementation and operation, along with hardware 
overhead and test coverage, can be found in reference (>. 

Test Tools 

The following sections describe the tools and tests lhal were 
used and developed to test the three memory subsystem 
ASK \s: i h«> memory controller. I lie slave memory controller, 
and the data multiplexer. 

Addscan. Fig. 2 shows the How for scan synthesis. All three 
memory subsystem ASIC's were designed using a struct ured- 
custoin design method. 8 Each Standard cell block used the 
in-house Addscan tool" for scan insertion and the scan links 
between blocks were connected by hand in the lop-level 
net list of the chip. The internal scan order of each block is 
based on posl place and route information. 

Test Vector Generation. The test vector generation flow is 
shown in Fig, 3. The ATPG lools from Crosscheck, Inc. were 
used to generate most scan-based vectors. Some vectors 
were hand-generated. A gate-level netlist of the chip, prior to 
Addscan scan insertion, is used to create an ATPG database 
for vector generation. A similar data base is used by design- 
ers to do Tinner sialic liming analysis. Timver, a timing analy- 
sis tool from Aida, Inc., is used as a part of the test method- 
ology for two purposes. First, il allows the design to be 
checked for hold violations on all paths to guarantee that 
there will be no timing violations even if ATPG vectors exer- 
cise nonfunctional paths in the design. Secondly. Timver 
critical paths can be fed back into ATPG in the form, of a 
vector.tcp file to generate double-strobe path delay vectors. 



The following lest vector sets were created for each of the 
memory subsystem ASICs: 

• Continuity. Checks for opens and shorts among the ESI) 
protection diodes. Prepared manually. 

• Ringtest. Uses serial "flush" speed (total scan path delay) 
through the boundary scan register as a measure of the IC 
process and verifies that the part is within the six-sigma 
range. Generated manually in the form of a Cadence Verilog 
body file. 

■ Dc. These tests use the boundary scan ring to drive out all 
ones or zeros for dc parametric testing. Generated manually 
in the form of a Verilog body file, 

i Leakage and Instate testing, Places the ASIC into a high- 
impedance state to allow testing the I/O pads for leakage. 
Generated manually in the form of a Verilog body file. 

' [00$ These vectors are generated by ATPG and are used to 
perform static- Inny test and measurement. 

i TAP Tests. These are tests targeted at functional testing of 
the TAP conlroller. Generated manually in the form of a 
Verilog body file. 

• Chiptest. These vectors arc generated by ATPG to test the 
core chip logic in from and out to the boundary scan ring 
using the TAP CHIPTEST instruction. I/O pad logic is not fully 
test ed by chiptest vectors. 

Pintest: These vectors are generated by ATPG and will test 
the remaining faults (primarily in the I/O pad logic) that are 
not covered by the chiptest. 

Bus Holder. Further testing of the electrical characteristics 
of the bidirect ional I/O cells. Generated manually in the 
form of a Verilog body file. 

BIST. BIST vectors are only generated on the memory con- 
I roller. These tests require only two scan vectors, one each 
to set up the initialization and test passes for BIST. After 
that, a burst of system clocks is applied to test the target 
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blocks at speed. These vectors are generated using a Perl 
script to produce a .tst vector file. 

Double-Strobe. These vectors are generated by ATPG based 
on Timver critical paths and are used to provide at-speed 
testing of the ASK'. 1 " " 

Ac lesting of I/O Paths. These are functional tests that test 
I he speed characteristics of critical I/< > paths. Generated 
manually only if testing, design review, and chip character- 
ization results indicate a concern. 

Process. Voltage, and Temperature (PVT) Block Test. Gener- 
ated manually. Ihis group of tests applies only CO the slave 
memory controller chip which uses a unique PVT block to 
eonipensale for process, voltage, and temperature varia- 
tions in a particular I/O cell. 

Vector and Test Logic Verification. Pig. 1 Shows the How for 
test vector verification using Verilog or LSIM (a special FET- 
level simulator). Test vectors from ATPG can be directly 
convened using TSSI la tool for lest program generation 
from TSSI. Inc.) into a command file format and verified 
against a gate-level tietlist in Verilog or a FET-level netlist 
using LSIM. Alternatively, test vectors can be simulated 
using a Verilog body file. A body file is a wrapper or test jig 
that can either be a test vector sel itself ( hand-generated 
functional tests) or can run scan-based ATPG vectors using 

a scan and clock sequence. 

The AT&T Tapdance tool was used for further verification 
of the TAP logic before tape release of the ASIC's. Tapdance 
generates a set of IKEE 1 14!). 1 compliance tests to verify 
standard TAP functionality. The Tapdance vectors were 
converted using Perl scripts t into a Verilog force file and 
simulated on a gate-level netlist. 

Tester Format Translation. Fig. I also shows the flow for trans- 
lation of vectors into a tester format I 'sing TSSI, vectors 
were formatted directly to IIP 820*10 tester formal. To gel to 
the Schlumberger SOOOl) tester, vectors were first Formatted 

t PbiI ll a high-level programming language 



to I.SIM and then passed through SIMITS, a format converter 
from Schlumberger. 

I ! sing a Verilog programming language interface that outputs 
a TSSI simulation event formal file dump, vectors can also 
be translated from body files to one of the testers. 

System DFT Features 

The new systems have been designed to provide a method 
to access ASIC scan paths, both boundary and internal, at 
the system level. This has two major purposes. First, it pro- 
vides a means of accessing i he internal state of complex 

VLSI components. This provides additional hardware slate 
information to designers that would typically be inaccessi- 
ble and can aid traditional prototype bring-up and debug 
methods. Second, il provides the ability to do scan-hasei I 
testing of board and system interconnect and internal scan 
lesting of ASICs. 

The following lesl and debug features are provided by system 
scan access: 

• Ability to halt ihe system clocks ami interrogate the internal 
scan stale of Ihe ASICs. 

• Single-cycle debug of ihe system core by halting the System 
clocks, interactive scanning of Ihe internal state, and then 
starting or cycling the system clocks. 

• Board-level and system-level Interconnect lesting and inter- 
active debug using boundary scan. This includes testing 
connectors between I wo hoards where boundary scannable 
buses cross the connector. 

• Abilily to lesl an ASIC while it is on lite hoard using bound- 
ary and internal scan. This may include double-strobe tests 
and running on-chip BIST, if supported by the ASIC under 
test. 

As part of the overall DFT requirements, all ASICs implement 
the IEEE 1 1 III. 1 Standard Test Access Port and Boundary- 
Scan Architecture. This provides support for system-level 
scan access. In addition, key debug support features are 
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incorporated into the system c lock controller chip to allow 
for halting and controlling the system clocks. Further infor- 
mation on system clock controller features can be found in 
reference 12. 

Fig. 5 shows a diagram of the system-level scan access hard- 
ware. The Texas Instruments PC-based Asset toolset is used 
as the interface to system scan. The Asset PC is connected 
to a scan adapter board via the Asset interface pod. The 
scan adapter board then plugs onto the system board and 
provides control of the system clock controller features, the 
TAP controller mid system logic reset, the clock hall I riggers, 
and the I/O device clock halt from the Asset software. The 
scan paths in die system are configured as a single serial 
scan chain witii optional system boards implemented as 
dynamic scan paths that can be configured in Asset. 

The Asset scan tools provide the following capabilities for 
system scan access: 

Interactive control of scan path data and TAP controller 
insi ructions with scan-bit name mapping and packing and 
unpacking of scan data. 

Macro scripting capabilities for combining several interactive 
operations into a single macro command. Asset also accepts 
serial vector format scan vectors for user-developed tests. 
Specification of system scan path configuration for dynamic 
scan paths and optional boards, such as CPUs, memory 
extender boards, and I/O. 

Scan path integrity testing and boundary scan interconnect 
testing of intraboaid and interboard nets. 
Control of system clock halt, single-cycle stepping, and 
system and TAP reset. 



Results and Conclusions 

The DFT techniques described above, which were cham- 
pioned by the DFT core team, were implemented in several 
differenl ASICs wilh varying degrees of adherence to the 
DFT rules and methodology. In general, results obtained 
during prototype chip debug have shown a direct correlation 
between the level of DFT Implementation and die rapidity of 
test development, chip characterization, and root-cause 
analysis. For example, while the three memory subsyslem 
ASICs were the last to reach tape release, these chips were 
the Hist to reach the operational test release (OTR) and re- 
lease lo manufacturing tesl (KTPT) checkpoints. The avail- 
ability of high-quality and comprehensive test sets for these 
chips enabled chip characterization efforts to be started 
right away. Furthermore, success in reaching the OTR 
checkpoint made it possible to transfer die task of testing 
prototype chips (winch are used In I he prototype systems) 
to the manufacturing engineers. This had a very positive 
effect on resources available to perform chip characteriza- 
tion, hi turn, successful completion of this slep coupled wilh 
efforts of the R&D engineers to improve test coverage en- 
abled the team to reach die KTPT milestone well before any 
of I he other ASICs had reached their OTR checkpoints. 

The Asset tool and its customized extensions provided a low- 
cost system scan access solution with flexible functionality 
and ease of use. As a commercial tool solution it cut down on 
development and maintenance costs compared to developing 
a proprietary toolset and can be reused for future projects. 
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fornia He was the pioject 
manager for development of 
the Software Solution Bro- 
ker reported in this issue and is one of its coarchi- 
tects. He was also responsible for the negotiation 
and procurement of third-party product licenses 
needed for various Software Solution Broker product 
features He received a BS degree in computer sci- 
ence from the California Polytechnic State University 
in 1983. His previous accomplishments at HP include 
work as an architecture team member for an object- 
oriented manufacturing application and work on an 
ob|ect-oriented text management system Before 
coming to HP, Manny was with U.S. Sprint as a senior 
software engineer and project manager and with 
Memorex Corporation as a software engineer. He is 
interested in object-oriented development and meth- 
odologies and distributed client/server applications 
In his free time, he enioys gardening. 

Adel Ghoneimy 

^^^^ Adel Ghoneimy is a free- 
^^Bt^ ! anl:e software consultant 

B and has been working with 
T ^» — %™ HP's Professional Services 
Division since 1993. He is 
responsible for the architec- 
ture and the ongoing devel- 
opment of the Software 
Solution Broker reported in 
this issue. He holds a BS degree in computer science 
and automatic control from Alexandria University, 
Egypt and has an MS degree in computer science 
from the University of Minnesota. Before becoming 
an HP contractor. Adel was with Digital Equipment 
Corporation where he led the development of a work- 
flow system and with Honeywell Corporation where 
he managed large software development efforts and 
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contributed to several projects in the area of distrib- 
uted systems applied research and advanced devel- 
opment. At Honeywell, he worked on problems of 
systems engineering for distributed automation and 
control systems He has published applied research 
articles in the areas of workflow, object manage- 
ment, and instrumentation for distributed applica- 
tions. His work has produced two patents concerning 
long-lived transactions and workflow management 
systems He is interested in business process engi- 
neering, object-oriented development and methodol- 
ogies, distributed client/server applications, and 
transactional workflow management systems He was 
born in Cairo, Egypt and enjoys ballroom dancing 

Wulf Rehder 

Wulf Rehder came to HP's 
Systems Technology Division 
in 1986 and is now the cur- 
riculum manager for object 
technology with HP's Profes- 
sional Services Division He 
received a BS degree in 
mathematics and physics 
from Hamburg University in 
1969, an MS in mathematics and statistics from Don- 
mund University in 1972. and a PhD in mathematics 
from Berlin Technical University in 1978. He is cur- 
rently responsible for the development and deploy- 
ment of customer education courses in ob|ect tech- 
nology He participated in the discussions of the 
development team for the Software Solution Broker 
reported in this issue. His previous contributions at 
HP include simulation and modeling for PA-RISC com- 
puters, work as a project manager at the HP Labora- 
tories Pisa Science Center in Italy, and the creation of 
test algorithms for field programmable gate arrays at 
HP Laboratories. Before joining HP, he was a perfor- 
mance manager at Metaphor Computer Systems and 
a mathematics professor at the California State Uni- 
versity at San Jose. A prolific author, Wulf has pub- 
lished 36 scientific papers on mathematics, physics, 
and engineering, another 50 articles for various mag- 
azines and newspapers, and a book. He is interested 
in object-oriented methods, mathematical tools, and 
"making difficult things appear simple." He was born 
in a small village in northern Germany, is married, 
and has two children He enjoys writing for several 
literary magazines and is a contributing editor for the 
Bloornsbury Review and a contributing writer for the 
Bostun Book Review 
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Jack D. Benzel 

Jack Benzel is a member of 
the technical staff at the 
Integrated Circuits Business 
Division (ICBDI in Fort Col- 
lins. Colorado. He is respon- 
sible for structured-custom 
' ! IC design for floating-point 
math coprocessors for PA- 
RISC processors He joined 
the company in 1990 Beginning in test process engi- 
neering in ICBD manufacturing, he developed an off- 
line wafer inking system before moving to the ICBD 
design lab three and a half years ago Jack holds a 
BSEE degree from Colorado Slate University (19851 





and was with IBM at Boulder, Colorado for fout years 
before joining HP He served as a design engineer for 
the voltage contrast imaging project reported in this 
issue. He is married, has three young children, and 
enjoys time with his family, camping, and music and 
drama performance at church 
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Bulent I. Dervisoglu 

Before leaving HP recently 
to join Silicon Graphics. Bul- 
ent Dervisuglu was a senior 
consulting engineer at HP's 
Advanced Systems Division. 
He received his BSEE degree 
from the Middle East Techni- 
cal University in Ankara, 
Turkey, in 1969 and his PhD 
degree in computer science from the University of 
Edinburgh, Scotland, in 1973 He taught and con- 
ducted research at several universities in Europe and 
the U.S.A. before joining the Sperry Corporate Re- 
search Center in Massachusetts in 1980. Later he 
joined the MIT Lincoln Laboratories in Lexington, 
Massachusetts, where his research interests in- 
cluded built-in self-test and design verification. He 
joined Apollo Computer in 1989 and became an HP 
employee with HP's 1989 acquisition of Apollo. Bul- 
ent has published extensively in various IEEE techni- 
cal journals and conferences and is the chairman of 
the IEEE P1 1 49.2 working group for developing a flex- 
ible and high-performance-oriented version of bound- 
ary-scan architecture. 

Michael Ricchetti 

Mike Ricchetti is a member 
^H^^^^ of the consulting staff at 

B Cadence Design Systems 
m^z where he is responsible for 

' design for testability (OFT] 

methodology and tools. He 
earned a BS degree m com- 
puter engineering at Ohio 
State University in 1982 
Previously, he was with Hewlett-Packard and Apollo, 
where he worked on design for testability and chip 
test development with the Workstation Systems Divi- 
sion He was with HP-Apollo for 10 years and was 
the DFT engineer for the memory controller ASIC re- 
ported in this issue. Before joining Apollo in 1985, he 
was a diagnostics engineer at Digital Equipment Cor- 
poration Mike is an IEEE member and participates on 
the IEEE PI 149.2 Testability Standards Committee 
working group 
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